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Background of the Invention 

1. Cross Reference to Related Applications. 

This application claims the benefit under 35 U.S.C. §1 19(e) to U.S. Provisional 
Patent Application Serial No. 60/155,321 entitled "4-KBITS/S SPEECH CODING," 
(Attorney Docket No. 99RSS485P), filed September 22, 1999; and is a continuation-in- 
part of United States Patent Application Serial Number 09/574,396 (Attorney Docket 
No. 246/258), "A NEW SPEECH GAIN QUANTIZATION STRATEGY," filed May 19, 

2000, and is now United States Patent Number , both of which are 

incorporated by reference in their entirety. 

The following commonly assigned U.S. patents and co-pending and commonly 
assigned U.S. patent applications further describe other aspects of the embodiments 
disclosed in this application and are incorporated by reference in their entirety. 

United States Patent Number 5,689,615, "USAGE OF VOICE ACTIVITY 
DETECTION FOR EFFICIENT CODING OF SPEECH," issued November 18, 1997. 

United States Patent Number 5,774,839, "DELAYED DECISION SWITCHED 
PREDICTION MULTI-STATE LSF VECTOR QUANTIZATION," issued June 30, 
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1998. 

United States Patent Number Patent Number 6,104,992, "ADAPTIVE GAIN 
REDUCTION TO PRODUCE FIXED CODEBOOK TARGET SIGNAL," issued August 
15, 2000. 

United States Patent Application Serial Number 09/1 56,649 (Attorney Docket 
No. 95E020), "COMB CODEBOOK STRUCTURE;' filed September 18, 1998, and is 
now United States Patent Number . 

United States Patent Application Serial Number 09/365,444 (Attorney Docket 
No. 97RSS380), "BI-DIRECTIONAL PITCH ENHANCEMENT IN SPEECH CODING 
SYSTEMS," filed August 2, 1999, and is now United States Patent Number . 

United States Patent Application Serial Number 09/156,814 (Attorney Docket 
No. 98RSS365), "COMPLETED FIXED CODEBOOK FOR SPEECH ENCODER," 
filed September 18, 1998, and is now United States Patent Number . 

United States Patent Application Serial Number , "SYSTEM FOR 

AN ADAPTIVE EXCITATION PATTERN FOR SPEECH CODING," Attorney 
Reference Number: 98RSS366 (10508.9), filed on September 15, 2000, and is now 
United States Patent Number . 

United States Patent Application Serial Number 09/574,396 (Attorney Docket No. 
99RSS312), "COMPLETED FIXED CODEBOOK FOR SPEECH ENCODER," filed 
May 19, 2000, and is now United States Patent Number . 

United States Patent Application Serial Number 09/154,660 (Attorney Docket No. 
98RSS384), "SPEECH ENCODER ADAPTIVELY PITCH PREPROCESSING WITH 
CONTINUOUS WARPING," filed September 18, 1998, and is now United States Patent 
Number . 

United States Patent Application Serial Number 09/154,662 (Attorney Docket No. 
98RSS383), "SPEECH CLASSIFICATION AND PARAMETER WEIGHTING USED 
IN CODEBOOK SEARCH," filed September 18, 1998, and is now United States Patent 
Number . 
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United States Patent Application Serial Number 09/154,675 (Attorney Docket No. 
97RSS383), "SPEECH ENCODER USING CONTINUOUS WARPING IN LONG 
TERM PREPROCESSING;' filed September 18, 1998, and is now United States Patent 
Number . 

5 United States Patent Application Serial Number 09/154,654 (Attorney Docket No. 

98RSS344), "PITCH DETERMINATION USING SPEECH CLASSIFICATION AND 
PRIOR PITCH ESTIMATION," filed September 18, 1998, and is now United States 

Patent Number . 

United States Patent Application Serial Number 09/156,650 (Attorney Docket No. 
10 98RSS343), "SPEECH ENCODER USING GAIN NORMALIZATION THAT 
COMBINES OPEN AND CLOSED LOOP GAINS," filed September 18, 1998, and is 

now United States Patent Number . 

United States Patent Application Serial Number 09/154,657 (Attorney Docket No. 
98RSS328), "SPEECH ENCODER USING A CLASSIFIER FOR SMOOTHING 
15 NOISE CODING," filed September 18, 1998, and is now United States Patent Number 



United States Patent Application Serial Number (Attorney Docket 

No. 99RSS227), "METHOD FOR SPEECH CODING USING SNR," filed August 16, 

2000, and is now United States Patent Number . 

20 United States Patent Application Serial Number (Attorney Docket 

No, 99RSS219), "METHOD FOR ROBUST CLASSIFICATION IN SPEECH 
CODING," filed August 21, 2000, and is now United States Patent Number . 

United States Patent Application Serial Number 09/156,648 (Attorney Docket No, 
98RSS228), "LOW COMPLEXITY RANDOM CODEBOOK STRUCTURE," filed 
25 September 18,1 998, and is now United States Patent Number . 

United States Patent Application Serial Number 09/156,416 (Attorney Docket No. 
98RSS011), "METHOD AND APPARATUS FOR DETECTING VOICE ACTIVITY 
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AND SILENCE IN A SPEECH SIGNAL USING PITCH LAG AND PITCH GAIN 
STATISTICS," filed September 18, 1998, and is now United States Patent Number 



United States Patent Application Serial Number 09/154,653 (Attorney Docket No. 
5 97RSS383), "SYNCHRONIZED ENCODER-DECODER FRAME CONCEALMENT 
USING SPEECH CODING PARAMETERS," filed September 18, 1998, and is now 

United States Patent Number . 

United States Patent Application Serial Number 09/156,826 (Attorney Docket No. 
98RSS382), "Adaptive Tilt Compensation For Synthesized Speech Residual," filed 
1 0 September 1 8, 1 998, and is now United States Patent Number . 

2. Field of the Invention. 

The present invention relates to speech coding, and more particularly, to speech 
coding systems that operate at a bit rate of 4 kbits/s. 

3. Related Art 

1 5 Speech coding systems may not operate effectively at low bit rates. When a small 

bandwidth is available to encode speech, the perceptual quality of encoded speech 
declines dramatically. Because of the increase use of wireless communication, there is an 
effort to reduce the bandwidth upon which such wireless communication systems 
operate. 

20 To efficiently decrease the wireless bandwidth but still retain a toll quality, a 

speech coding system generally performs a strict waveform matching. Waveform 
matching as employed in a low bit rate wireless coding system, such as 4 kbits/s, 
however, may not perceptually or accurately capture the speech information. Therefore, 
there is a need in the art for a system that provides a speech coding system with a high 

25 perceptual quality at a low bit rate. 
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Summary 

The invention is a system that improves the equality of encoding and decoding by 
focusing on the perceptually important characteristics of speech. The system analyzes 
features of an input speech signal and performs a common frame based speech coding of 
5 an input speech signal. The system then performs a speech coding based on either a first 
speech coding mode or a second speech coding mode. The selection of a mode is based 
on at least one feature of the input speech signal. The first speech coding mode uses a 
first framing structure and the second speech coding mode uses a second framing 
structure. 

1 0 The system may operate at approximately 4 kbits/s. The first framing structure 

and the second framing structure both use eighty bits. In this system eighty bits, are 
transmitted from an encoder to a decoder. The decoder receives the bits from the encoder 
and reconstructs the speech using the decoder. Twenty-one bits are allocated to code the 
linear prediction coefficients. One bit is allocated to the speech coding mode. Fourteen 

1 5 bits are allocated for an adaptive codebook for the first framing structure and seven bits 
are allocated to the second framing structure for the adaptive codebook for the second 
framing structure. In the first framing structure, thirty bits are used to code a first fixed 
codebook. In the second framing structure, thirty-nine bits are used to code a first fixed 
codebook. 

20 The first speech coding mode uses a two dimensional vector quantization gain 

codebook and a two dimensional code-vector. The two dimensional code-vector is 
selected from the two dimensional vector quantization gain codebook that has an adaptive 
codebook gain and a fixed codebook gain. Fourteen bits are allocated to the vector 
quantization gain codebook. 

25 The second speech coding mode uses two three dimensional vector quantization 

gain codebooks. A first three dimensional code-vector is selected from a first three 
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dimensional vector quantization gain codebook that has an adaptive codebook gain. A 
second three dimensional code-vector is selected from a second three dimensional vector 
quantization gain codebook that has a fixed codebook gain. Four bits are allocated to the 
first three dimensional vector quantization gain codebook. Eight bits are allocated to the 
5 second three dimensional vector quantization gain codebook. The first speech coding 
mode and the second speech coding mode operate at the same bit rate. 

In another aspect, the system dynamically decodes an encoded speech signal by 
selecting between a first speech decoding mode and a second speech decoding mode 
according to the mode bit transmitted from the encoder to the decoder. Many 

10 characteristics of the input signal may be used to make this selection. The characteristics 
can include the degree of noise-like content in the input speech signal, the degree of 
spike content in the input speech signal, the degree of voiced content in the input speech 
signal, the degree of unvoiced content in the input speech signal, the change in the 
magnitude spectrum of the input speech signal, the change of the energy contour of the 

15 input speech signal, and the level of periodicity in the input speech signal. 

Another aspect of the invention analyzes an input signal by dynamically selecting 
between a first speech coding mode and a second speech coding mode based on select 
features of the input signal. The system codes a first frame of the input speech signal 
using the first speech coding mode and codes a second frame of the input speech signal 

20 using the second speech coding mode. The number of bits allocated to the first speech 
coding mode may be based on the parameters used to code the input speech signal just as 
the number of bits allocated to the second speech coding mode may be based on the 
number of parameters used to code the input speech signal. A total number of bits used 
for each speech coding mode can also be the same. 

25 In another aspect of the invention, the extended code excited linear prediction 

speech coding system operates at approximately 4 kbits/s. The number of bits used for 
the speech coding modes is eighty bits. The first speech coding mode uses a first framing 
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structure and the second speech coding mode uses a second framing structure. The first 
framing structure codes a first fixed codebook using thirty bits and the second framing 
structure codes a second fixed codebook using thirty-nine bits. The second framing 
structure codes the adaptive codebook gains using four bits and codes the fixed codebook 
5 gains using eight bits. The system performs dynamic gain quantization in the first speech 
coding mode and the second speech coding mode, respectively. 

Another aspect of the invention analyzes an input signal by its features, performs 
a common frame based speech coding, and also performs a mode dependent speech 
coding using either a first speech coding mode or a second speech coding mode. In this 

10 aspect, at least one of the features of the input signal is a substantially non-periodic-like 
characteristic, and at least one of the of the other features of the input signal is a 
substantially periodic-like characteristic. 

In yet another aspect of the invention, the system uses a two dimensional vector 
quantization gain codebook and a two dimensional code-vector at approximately 4 

15 kbits/s. The two dimensional code-vector is selected from the two dimensional vector 
quantization gain codebook that has an adaptive codebook gain and a fixed codebook 
gain. The second speech coding mode uses a first three dimensional vector quantization 
gain codebook and a second three dimensional vector quantization gain codebook. A 
first three dimensional code-vector selected from the first three dimensional vector 

20 quantization gain codebook has an adaptive codebook gain. The second three 
dimensional code-vector selected from the second three dimensional vector quantization 
gain codebook has a fixed codebook gain. The system decodes the encoded speech 
signal and dynamically selects a speech coding mode according to the system that 
decodes of the speech signal. The system determines whether a received frame of the 

25 encoded speech signal is defective, and performs a frame erasure when a defective 
received frame is detected. 
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Other systems, methods, features and advantages of the invention will be or will 
become apparent to one with skill in the art upon examination of the following figures 
and detailed description. It is intended that all such additional systems, methods, features 
and advantages be included within this description, be within the scope of the invention, 
5 and be protected by the accompanying claims. 

Brief Description of the Figures 

The components in the figures are not necessarily to scale, emphasis instead being 
placed upon illustrating the principles of the invention. Moreover, in the figures, like 
reference numerals designate corresponding parts throughout the different views. 
10 Fig. 1 is a system diagram of a speech coding system performing signal pre- 

processing. 

Fig. 2 is a graph of noise level attenuation by the speech coding system. 
Fig. 3 is a block diagram of a common frame based system. 
Fig. 4 is a block diagram of a Mode zero speech coding system. 
15 Fig. 5 is a graph of a forward-backward pitch enhancement. 

Fig. 6 is a block diagram of a Mode one speech coding system. 
Fig. 7 is a block diagram of a decoder. 

Detailed Description 

The system employs an extended Code Excited Linear Prediction System 
20 (extended CELP) that is based on a Code Excited Linear Prediction System (CELPS) 
that performs speech coding. To achieve toll quality at a low bit rate, such as 4 kbits/s, 
the system puts emphasis on the perceptually important features of an input speech signal 
during the encoding process. This occurs by analyzing certain features of the input 
speech signal, such as the degree of noise-like content, the degree of spike-like content, 
25 the degree of voiced content, the degree of unvoiced content, the change in the magnitude 
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spectrum, the change in the energy contour, and the level of periodicity, for example. 
The system uses this information to control a weighting during an encoding/quantization 
process. The system represents accurately the perceptually important features of a speech 
signal, while allowing errors in the perceptually less important features. This is based on 

5 the observation that 4k bits/s is not sufficient to accurately represent the waveform of the 
input signal. In some sense, the system has to prioritize. For example, for a random-like 
signal, the system disregards the accuracy in the waveform matching to some extent and 
encourages the selection of the fixed codebook excitation form a Gaussian codebook. 
The system modifies the waveform of the input signal while leaving it perceptually 

10 indistinguishable in order to allow the model to more accurately represent the input 
signal. 

The system operates on a frame size of approximately 20 ms (or about 160 
samples) using either two or three subframes. The number of subframes is controlled by 
a mode selection. Mode zero ("0") uses two subframes and Mode one ("1") uses three 

15 subframes. For a Mode 0 the subframe size is approximately 10 ms (or about 80 
samples), and in a Mode 1 the first and the second subframes are approximately 6.625 ms 
(or about 53 samples) and the third subframe is approximately 6.75 ms (or about 54 
samples). In both Mode 1 and Mode 0, a look-ahead of approximately 15 ms is used. 
The one-way coding delay of the system adds up to approximately 55 ms according to the 

20 delay definition in the terms of reference. 

For both Mode 0 and Mode 1, a 10 th order LP (Linear Prediction) model is used to 
represent the spectral envelope of the signal. The 10 th order LT model is coded in the 
LSF (Line Spectrum Frequency) domain using a 21 bit delayed decision switched multi- 
stage predictive vector quantization scheme. One bit specifies one of two MA (Moving 

25 Average) predictors, and three stages (each with a 10 dimensional codebook) of 7 bits, 7 
bits, and 6 bits, respectively, are used to represent the prediction error. 
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Preferably, Mode 0 processes "non-periodic" frames. Examples of non-periodic 
frames may include transition frames where the typical parameters such as pitch 
correlation and pitch lag change rapidly or frames where the signal is dominantly noise- 
like. Mode 0 uses two subframes and codes the pitch lag once per subframe, and has a 2- 

5 dimensional vector quantizer of 7 bits that jointly codes the pitch gain and the fixed 
codebook gain once per subframe. Preferably, the fixed codebook includes at least three 
sub-codebooks, where two of the fixed sub-codebooks are pulse codebooks and the third 
sub-codebook is a Gaussian sub-codebook. In this embodiment, the pulse codebooks are 
a two-pulse sub-codebook and a three-pulse sub-codebook. Preferably, the Gaussian sub- 

10 codebook has two orthogonal basis vectors each having a dimension of 40, which lowers 
the complexity of the Gaussian sub-codebook search. The number of entries in the sub- 
codebooks may be 2 14 , 2 13 , and 2 13 5 respectively. Accordingly, 15 bits may be allocated to 
the fixed codebook in Mode 0. 

Preferably, Mode 1 processes "periodic" frames. Highly periodic frames can be 

15 perceptually well represented with a smooth pitch track. In Mode 1, a frame can be 
broken into three subframes. The pitch lag is coded once per frame prior to a subframe 
processing, which is part of the pitch pre-processing. An interpolated pitch track is 
derived from the pitch lag. In Mode 1, three pitch gains (one from each subframe) 
exhibit a very stable behavior and can be jointly quantized using vector quantization in an 

20 open-loop MSE fashion using 4 bits prior to a subframe processing. The three reference 
pitch gains, which are unquantized pitch gains, are derived from the weighted speech and 
are a product of the frame based pitch pre-processing. Using pre-quantized pitch gains, 
the traditional CELP subframe processing is performed while the three fixed codebook 
gains are left unquantized. The three fixed codebook gains are jointly quantized with an 

25 8 bits vector quantizer after subframe processing (a delayed decision) using a moving 
average (MA) prediction of the energy. Thereafter, the three subframes are synthesized 
with fully quantized parameters to update filter memories. During a traditional CELP 
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subframe process, the fixed codebook excitation is quantized with 13 bits per subframe. 
The codebook has three pulse sub-codebooks with 2 12 , 2 l \ and 2 11 entries, respectively, 
and the number of pulses in the sub-codebooks are 2, 3, and 6, respectively. 

The parameters of the system are represented by 80 bits per frame resulting in a 
bit-rate of 4 kbits/s. An overview of the bit-allocation is shown in Table. 1. 
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Table 1: Detailed bit-allocation. 



Parameter 



Bits per 20 ms 



Mode 0 (2 subframes) 



Mode 1 (3 subframes) 



LSFs 



Predictor 
switch 
1 st stage 
2 nd stage 
3 rd stage 



1 bit 

7 bits 
7 bits 
6 bits 



21 bits 



Mode 



Ibit 



Adapti 



ive codebook 



7 bits/subframe 



14 bits 



7 bits/frame 



7 bits 



Fixed codebook 



2- pulse 
codebook 

3 - pulse 
codebook 
Gaussian 
codebook 



16384/subframe 
8192/subframe 
8192/subframe 



32768/subframe 



15 bits/subframe 30 bits 



2- pulse codebook 

3 - pulse codebook 
6-pulse codebook 

13 bits/subframe 



4096/subfra 
me 

2048/subfra 
me 

2048/subfra 

me 

8192/subfra 
me 

39 bits 



Adaptive codebook 
gain 



Fixed codebook 
gain 



2D VQ/subframe 7 bits/subframe 
14 bits 



3D preVQ/frame 4 bits 



3D delayed 
VQ/frame 



8 bits 



TOTAL 



80 bits 



80 bits 



The 80 bits per frame of Table 1 are transmitted from an encoder to a decoder. 
Preferably, the decoder maps the 80 bits back to the parameters of the encoder. A 
5 synthesis of a speech signal from these parameters is similar to the ITU-Recommendation 
G.729 main body. The post-filter has a long-term (pitch) and a short-term (LPC) post- 
processing. 

1. Encoder System. 

Figures 1 and 3 illustrate the frame based processing stages that are used in Mode 
10 0 and Mode 1. The pre-processing stages that condition the speech signal prior to 
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encoding are shown in Fig. 1 and the common frame based encoding is shown in Fig. 3. 
The processing functions dedicated to Mode 0 and Mode 1, respectively, are shown in the 
Figures 4 and 6, respectively. 

Fig. 1 shows the pre-processing of a speech signal prior to the actual speech 
5 encoding. The pre-processing circuit includes a silence enhancement circuit or function 
110, a high-pass filter 120, and a background noise attenuation circuit or function 130. 
After an input signal 100 is received, a silence enhancement 110 function occurs. The 
enhanced signal is then filtered by a high pass filter (HPF) 120 and conditioned by a 
noise attenuation circuit 130 that generates a pre-processed speech signal 195. 

1 0 A. Silence Enhancement Function. 

After reading and buffering speech samples for a given frame, a speech segment 
is analyzed to detect the presence of pure silence, i.e., "silence noise." This function 
adaptively tracks a minimum resolution and the levels of the signal near zero. According 
to this analysis, the function adaptively detects on a frame-by-frame basis whether the 

15 current frame is silence and only contains "silence-noise." If a "silence noise" is 
detected, the silence enhancement 110 ramps the input signal to the zero-level of the 
speech input signal. The zero-level of the input speech signal 105 depends on the prior 
processing of the speech coding method. For A-law, the zero-level is 8, while for ja-law 
and 16 bit linear PCM (Pulse Code Modulation), the zero-level is 0. Preferably, the zero- 

20 level of the signal is tracked adaptively by the silence enhancement 110. It should be 
noted, that the silence enhancement 110 may only modify an input speech signal 105 if 
the sample values for the given frame are within two quantization levels of the zero-level. 

The silence enhancement 110 cleans up the silence portions of clean speech for 
very low-level noise, and enhances the perceptual quality of that speech. The effect of 

25 the enhancement 110 becomes especially noticeable when the input originates from an 
A-law source, i.e., the input has passed through an A-law encoding and decoding process 
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immediately prior to being processed by the speech coding system. The noticeable 
difference in the signal is due to the amplification of sample values around zero (e.g., -1, 
0, +1) to either -8 and +8 that is inherent in A-law. The amplification has the potential 
of transforming an inaudible "silence noise" into a clearly audible noise. 

5 B. High-Pass Filter, 

The input high-pass filter 120 is similar to the an input high-pass filter of G.729. 
It is a 2 nd order filter having a cut-off frequency of approximately 140 Hz. The high pass 
filter can be expressed as: 

= 0.92727435-1.8544941z^0.92727435,- (Equation 1) 

1-1.9059465^+ 0.91 14024z~ 2 

10 Preferably, the input is scaled down by a factor 2 during high-pass filtering. This may be 
achieved by dividing the coefficients of the numerator by a factor of 2. 

C. Noise Attenuation. 

Noise attenuation 130 having a maximum attenuation of about 5 dB is performed 
to improve the estimation of the parameters in the system while leaving the listener with 

15 a clear sensation of the listener's environment. In Fig. 2, a speech segment in 15 dB 
additive vehicle noise is shown with an output from G.729 and a 4 kbits/s eX-CELP. As 
shown, the noise attenuation 130 of Fig. 1 incorporated in the 4 kbits/s eX-CELP system 
results in an input-to-output attenuation slightly higher than the inherent attenuation of 
noise produced by G.729. More precisely, the ITU-Recommendation G. 729 output 

20 speech signal 215 illustrates the noise level attenuation of the noise in the input speech 
signal 205 having the 15 dB vehicle noise and the 4 kbits/s output speech signal 295 
illustrates the noise level attenuation of the noise in the input speech signal 205 having 
the 15 dB vehicle noise. 
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2. Common Frame Based Processing. 

Fig. 3 is a block diagram illustrating a preferred common frame based process 300 
that is performed on a pre-processed speech signal 195 prior to performing a Mode 
dependent processing. A pre-processed speech signal is received by a perceptual 
5 weighting filter block 350, a linear prediction coefficient (LPC) analysis block 310, and a 
voice activity detection (VAD) block 340. After passing through the perceptual 
weighting filter block 350, weighted speech is passed to a pitch processing block 380 and 
an open loop pitch estimation block 360. The pitch processing block 380 comprises a 
waveform interpolation block 382 and a pitch pre-processing block 384. A modified 

10 weighted speech signal is passed from the pitch processing block 380 to a Mode 
dependent processing block 395. 

A linear prediction coefficient (LPC) analysis block 310 processes the pre- 
processed speech 195 and generates an output received by the voice activity detection 
(VAD) block 340 and a line spectral frequency (LSF) smoothing block 320. Similarly, 

15 the voice activity detection (VAD) block 340 also processes the pre-processed speech 
195 and generates an output received by the line spectral frequency (LSF) smoothing 
block 320. The line spectral frequency (LSF) smoothing block 320 processes the output 
from the linear prediction coefficient (LPC) analysis block 310 and the voice activity 
detection (VAD) block 340 and generates an output received by a line spectral frequency 

20 (LSF) quantization block 330. The line spectral frequency (LSF) quantization block 330 
generates an output, A q (z\ received by the mode dependent processing block 395. 

The voice activity detection (VAD) block 340 also provides an output to a 
classification block 370 that generates control information received by the mode 
dependent processing block 395 and a mode selection block 390. The weighted speech 

25 generated by the perceptual weighting filter block 350 is received by the classification 
block 370 and the pitch processing block 380 after being processed by the open loop 
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pitch estimation block 360. The pitch processing block 380 and the classification block 
370 are also communicatively coupled. The pitch processing block 380 and the 
classification block 370 generate output received by a mode selection block 390. The 
pitch processing block 380 provides pitch track information and unquantized pitch gains 
to the mode dependent processing block 395. 
A. LPC Analysis. 

Preferably, in each frame three 10 th order LPC analyses are performed. The LPC 
analyses are centered at a middle third, a last third, and a lookahead of a frame. The LPC 
analysis for the lookahead frame is recycled in the next frame as the LPC analysis 
centered at the first third of that frame. Consequently, four sets of LPC parameters are 
available at the encoder in each frame. 

A symmetric Hamming window is used for the LPC analyses of the middle and 
last third of the frame, and an asymmetric Hamming window is used for the LPC analysis 
of the lookahead segment to center the weight appropriately. For each of the windowed 
segments, a 10 th order autocorrelation coefficients, r{k\ may be calculated according to 
Equation 2, 

, i v\ (Equation 2) 

where 5 w (ii)is the speech signal after weighting with the proper Hamming window. A 
Bandwidth expansion of 60Hz and a white noise correction factor of 1.0001, i.e., adding 
a noise floor of -40 dB, are applied by weighting the autocorrelation coefficients 
according to Equation 3, 

r w (*) = H<*)-K*) (Equations) 



where the weigthing function is expressed by Equation 4 
f ^ 1.0001 ^ * = o 



1 ( 2tu • 60 • k 



2 1 8000 



(Equation 4) 

* = 1,2,...,10 
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Based on the weighted autocorrelation coefficients, the short-term LP filter coefficients, 
i.e., 

10 

A{z) = l - ]T a, • z" , (Equation 5) 

7=1 

are estimated using the Leroux-Gueguen method, and the LSF (Line Spectrum 
5 Frequency) parameters are derived from the polynomial A(z). Three sets of LSF 
paramters can be represented as expressed in Equation 6, 

lsf 7 (k) , k - 1,2.. ,10 (Equation 6) 

where lsf 2 (&), lsf 3 (A) , and lsf 4 (A)are the LSFs for the middle third, last third, and 
lookahead of the frame, respectively. 

10 If the signal has extremely low energy, such as zero energy based on an integer 

truncated signal, a flat LPC spectrum is generated. This result prevents certain low level 
problems caused by interaction between the LPC filter and the gain quantization. It has 
been found that in some cases of very low level energy segments, such as practically zero 
energy, the LPC filters can have high gains. In this condition, the predictive gain 

15 quantizer for a fixed codebook gain generally is unable to reduce the energy level to a 
target level, and consequently, audible artifacts are generated. This condition is avoided 
by the described system. When this condition is not encountered (in case of non-zero 
signal), the reflection coefficients and prediction coefficients are derived and converted to 
the LSFs. 

20 B. LSF Smoothing. 

Before LSF quantization, the LSFs are smoothed in time to reduce unwanted 
fluctuations in the spectral envelope of the LPC synthesis filter. Smoothing is done 
during "smooth" background noise to preserve the perceptual characteristic of the 
background noise. The smoothing is controlled by the VAD information and analysis of 
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the evolution of the spectral envelope. The LSF smoothing factor is denoted /? lsf and is 
applied according to the following paramters. 

1. At the beginning of "smooth" background noise segments the smoothing 
factor is preferably ramped quadraticly from 0.0 to 0.9 over 5 frames. 
5 2. During "smooth" background noise segments the smoothing factor is 

preferably 0.9. 

3. At the end of "smooth" background noise segments the smoothing factor is 
preferably reduced to 0.0 instantaneously. 

4. During non- "smooth background noise segments" the smoothing factor is 
10 preferably 0.0. 

According to the LSF smoothing factor, the LSFs for the quantization can be calculated 
as follows: 

lsf a (*) = /V •lsf B . I (*)+(l-^)-lsf 3 (t), k = 1,2 10 (Equation 7) 

where lsf n (&) and lsf n .j (k) represents the smoothed LSFs of the current and previous 
15 frame, respectively, and lsf 3 (&) represents the LSFs of the LPC analysis centered at the 
last third of the current frame. 

C. LSF Quantization. 

The 10 th order LPC model given by the smoothed LSFs (Equation 7) is quantized 
in the LSF domain once per frame using 21 bits. The detailed bit-allocation is shown in 

20 Table 1. A three stage switched MA (Moving Average) predictive vector quantization 
scheme is used to quantize the 10 dimensional LSF vector. The input LSF vector 
(unquantized vector) originates from the LPC analysis centered at the last third of the 
frame. The error criterion of the quantization is a WMSE (Weighted Mean Squared 
Error) measure, where the weighting is a function of the LPC magnitude spectrum. 

25 Accordingly, the objective of the quantization can be expressed as Equation 8, 
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{isf n (1), Isf, (1),..., Isf B (10)} = arg mm £ w, -(isf n (k) - Isf n (*)) 2 (Equation 8) 

J 

where the weighting is 

w ; = \PQsf„ (0)|° 4 (Equation 9) 

and where is the LPC power spectrum at frequency/ the index n denotes the frame 
5 number. The quantized LSFs lsf„ (k) of the current frame are based on a 4 th order MA 
prediction and is given by Equation 10, 

Isf „ = Isf „ + A^ f (Equation 10) 

where lsf w is the predicted LSFs of the current frame (a function of ^ P A^ 2 ,A^ 3 ,A^ ), 

A Isf 

and A B is the quantized prediction error at the current frame. The prediction error is 

1 0 given by Equation 1 1 . 

4L sf = Isf x - gf „ . (Equation 11) 

The prediction error from the 4 th order MA prediction is quantized with three 10 
dimensional codebooks of sizes 7 bits, 7 bits, and 6 bits, respectively. The remaining bit 
is used to specify either of two sets of predictor coefficients, where the weaker predictor 

15 improves (reduces) error propagation during channel errors. The prediction matrix is 
fully populated, i.e., prediction in both the time and the frequency is applied. A closed 
loop delayed decision is used to select the predictor and the final entry from each stage 
based on a subset of candidates. The number of candidates from each stage is 10, 
resulting in the future consideration of 10, 10, and 1 candidates after the 1 st , 2 nd , and 3 rd 

20 codebook, respectively. 

After reconstruction of the quantized LSF vector according to Equation 10, the 
ordering property is checked. If two or more pairs are flipped the LSF vector is declared 
erased and is reconstructed preferably using a frame erasure concealment of the decoder. 
This check facilitates the addition of an error check at the decoder based on the LSF 
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ordering while maintaining bit-exactness between the encoder and the decoder during 
error free conditions. An encoder-decoder synchronized LSF erasure concealment 
improves performance during error conditions while not degrading performance in error 
free conditions. Although theoretically this condition may occur during speech, it was 
5 found to rarely occur. If only one pair is flipped, they are re-ordered in synchrony with 
the decoder. Finally, a minimum spacing of 50Hz between adjacent LSF coefficients is 
enforced. 



10 information on the characteristic of the input signal. The VAD information is used to 
control several aspects of the encoder including estimation of Signal to (background) 
Noise Ratio (SNR), pitch estimation, classification, spectral smoothing, energy 
smoothing, and gain normalization. The voice activity detection system is based on the 
absolute maximum of a frame, reflection coefficients, prediction error, an LSF vector, the 

15 10 th order autocorrelation, recent pitch lags, and recent pitch gains. The LPC related 
parameters originate from the LPC analysis centered at the last third of the frame. The 
pitch related parameters are delayed by one frame since pitch lags and gains of the 
current frame are not yet available. 



D. 



VAD (Voice Activity Detection). 



A voice activity detection system is embedded in the encoder to provide 



E. 



Perceptual Weighting Filter. 



20 



The perceptual weighting filter is comprised of two filters. The first filter is 
derived from the unquantized LPC filter given by: 




(Equation 12) 



where y x = 0.9 and y 2 = 0.55 . The second filter is an adaptive low-pass filter given by: 



\-7}Z 



(Equation 13) 
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where r? is a function of the tilt of the spectrum, i.e., the first reflection coefficient of the 
LPC analysis. The second filter, which is a weighting filter is used only for the open loop 
pitch estimation, waveform interpolation and pitch pre-processing. For the adaptive and 
fixed codebook searches, gain quantization, etc., only the first filter (i.e., first wieghting 
5 filter) is applied. 

F. Open Loop Pitch Estimation. 

For every frame, the open loop pitch lag has to be estimated for the first half and 
the second half of the frame. The Mode 0 uses the two open loop pitch lags for the 
search of the adaptive codebook for the first and second subframe, respectively. Mode 1 
10 uses the open loop pitch lag for the second half of the frame as basis for the interpolated 
pitch track for the pitch pre-processing. The open loop pitch lag for the first half of the 
frame is not used for Mode 1 . 

The open loop pitch estimation is based on the weighted speech given by 
Equation 14, 

1 5 S w O) - S(z) • W x (z)W 2 0) (Equation 14) 

where S(z) is the pre-processed speech signal. The pitch lag preferably ranges from 17 to 
127 samples. 

Two open loop pitch lags and pitch correlation coefficients are estimated per 
frame. The first set is centered at the second half of the frame, and the second set is 
20 centered at the lookahead of the frame. The set centered at the lookahead is reused 
during the next frame as the set centered at the first half of the frame. Consequently at 
every frame, three sets of pitch lag and pitch correlation coefficient are available at the 
encoder at the computational expense of two sets. 

Each of the two sets is calculated according to the following steps. First, the 
25 normalized correlation function is calculated as given by: 
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i 

^j w (w)-,s w (w-A) 

R (jc) = 22 s (Equation 15) 

E 

where L = 80 is the window size, and E, which is the energy of the segment, is expressed 
as: 

L 

E = J^s w (n) 2 ( Equation 16) 

»=0 

5 The maximum of the normalized correlation R(k) in each of three regions [1733], 
[34,67], and [68,127] are then determined. This results in three candidates for the pitch 
lag. An initial best candidate from the three candidates is selected based on the 
normalized correlation, classification information, and the history of the pitch lag. Once 
the initial best lag for the second half of the frame and the lookahead is available, the 
10 initial estimates for the lag at the first half, the second half of the frame, and the 
lookahead are ready. A final adjustment of the estimates of the lag for the first and 
second half of the frame is calcualted based on the context of the respective lags with 
regards to the overall pitch contour, e.g., for the pitch lag for the second half of the frame, 
information on the pitch lag in the past and the future (the lookahead) is available. 

15 G. Classification. 

The eX-CELP method makes use of classification in many modules to emphasize the 
perceptually important features during encoding. The three main frame based 
classifications are detection of unvoiced noise-like speech, a six grade signal 
characteristic classification, and a six grade classification to control the pitch pre- 
20 processing. 
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3. Detection of Unvoiced Noise-Like Speech. 

The detection of unvoiced noise-like speech is used for several purposes. One 
main purpose being generating the noise-like speech class in the Signal Characteristic 
Classification, and controlling the pitch pre-processing. The detection is based on the 
5 weighted residual signal given by Equation 17 and the pre-processed speech signal. 
R w (Z) = ) • S(z) (Equation 17) 

From the input signals, the residual sharpness, first reflection coefficient, zero crossing 
rate, and the prediction factor are calculated and used by the decision logic. Residual 
sharpness can be expressed as Equation 18, 

10 <j> = , * =0 . , (Equation 18) 

where r w (n) is the weighted residual signal andZ, = 160 is the frame size. First reflection 
coefficient (tilt of the magnitude spectrum) of the pre-processed speech signal can be 
expressed as Equation 19, 

L-1 

(Equation 19) 



15 where s(n) is the pre-processed speech signal mdL = 160 is the frame size. Zero crossing 
rate of the pre-processed speech signal can be expressed as Equation 20 and 

y = £ {sin) ■ s(n - 1) < 0 ? l} . (Equation 20) 

prediction factor can be expressed as Equation 21, 



17 = 1- 



^ (Equation 21) 



\ I>) 2 
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The detection of noise-like unvoiced speech is performed in the 4 dimensional space 
spanned by (0,<p,y,Tj) by comparison to fixed decision boundaries. 

4. Signal Characteristic Classification, 

The eX-CELP system classifies frames into one of six classes according to a 
5 dominant features of that frame. The frame may be classified according to: 

0 . Silence/Background Noise; 

1 . Noise-Like Unvoiced Speech; 

2. Unvoiced; 

3. Onset; 

10 4. Plosive, (which is not used); 

5. Non-Stationary Voiced; and 

6. Stationary Voiced. 

Currently, class 4 is not used. To more effectively use information available in the 
encoder, the central module for the classification does not initially distinguish classes 5 

15 and 6. This distinction is instead done during the pitch pre-processing where additional 
information is available. Furthermore, the central module does not initially detect class 1. 
This class is also introduced during the pitch pre-processing based on additional 
information and the detection of noise-like unvoiced speech. Hence, the central module 
distinguishes between silence/background noise, unvoiced speech, onset, and voiced 

20 speech using the class number 0, 2, 3, and 5, respectively. 

The central signal classification module receives the pre-processed speech signal, 
the pitch lag and correlation of the second half of the frame, and the VAD information. 
Based on these parameters, the module initially derives the spectral tilt, the reflection 
coefficient, and the pitch correlation. The spectral tilt (estimation of first reflection 

25 coefficient 4 times per frame) can be calculated by Equation 22, 
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K(k) = -^-p k = o,l,„. 3 , (Equation 22) 

where I = 80 is the window over which the reflection coefficient is calculated and s k (ri) is 
the k th segment calculated by Equation 23. 

s k (n) = s(k • 40 - 20 + n) • w h (n) , = 0,1,...79 (Equation 23) 

5 In Equation 23, w h (n) is an 80 sample Hamming window and s(0),s(l),...,s(l59) is the 
current frame of the pre-processed speech signal. The absolute maximum (tracking of 
absolute signal maximum 8 estimates per frame) can be calculated by Equation 24, 

Z (k) = max{s(n) |, « = «, (*), n s (k) + 1,..., « e (*) - 1} , k = 0,1 9 ...,7 (Equation 24) 

where and « e (ifc) is the starting point and end point, respectively, for the search of 

10 the k th maximum at time Jfe- 160/ 8 samples of the frame. Preferably, the segments overlap 
and the length of the segment is approximately one and one-half (L5) times the pitch 
period. At this point, a smooth contour of the amplitude envelope is obtained. Thus, the 
spectral tilt, the absolute maximum, and the pitch correlation form the basis for the 
classification. However, significant processing and analysis of the spectral tilt, the 

15 absolute maximum, and the pitch correlation parameters are performed prior to the 
decision. 

The parameter processing initially applies weighting to the three parameters. The 
weighting removes the background noise component in the parameters. This provides a 
parameter space that is "independent" from any background noise and thus more uniform 
20 which improves the robustness of the classification to background noise. 

Running means of the pitch period energy of the noise, the spectral tilt of the 
noise, the absolute maximum of the noise, and the pitch correlation of the noise are 
updated 8 times per frame according to Equations 25 through 28. These updates are 
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controlled by the VAD. The parameters defined by Equations 25 through 35 are 
estimated 8 times per frame and provides a finer time resolution of the parameter space. 
The running mean of the pitch period energy of the noise is calculated by Equation 25, 

< E Np (ft) >=a v < E N p (ft - 1) > +(1 -a x )<E p (ft) (Equation 25) 

5 where E Np (k) is the normalized energy of the pitch period at time ft -160/ 8 samples of the 

frame. It should be noted, that the segments over which the energy is calculated may 
overlap since the pitch period typically exceeds 20 samples (160 samples/8). 

The running mean of the spectral tilt of the noise is calculated by Equation 26. 
10 < k N (ft) >= a x • < k n (ft - 1) > +(1 - a x ) • x(k mod 2) (Equation 26) 

The running mean of the absolute maximum of the noise is calculated by Equation 27. 

< Zn (*) >= <V < X n (*-!)> +0 - a, ) • *(*) (Equation 27) 
The running mean of the pitch correlation of the noise is calculated by Equation 28, 

< R Np (ft) >=a r < R NtP (ft - 1) > +(1 -a x yR p (Equation 28) 

15 where R p is the input pitch correlation for the second half of the frame. The adaptation 
constant a x is adaptive, though the typical value is a x = 0.99. The background noise to 
signal ratio is calculated by Equation 29. 



mB f^E#r (Equation 29) 

Preferably, the parametric noise attenuation is limited to 30 dB, i.e., 
20 y(k) - {y(k) > 0.968 ? 0.968 : ^(ft)} (Equation 30) 
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The noise free set of parameters (weighted parameters) is obtained by removing the noise 
component according to Equations 31 through 33. Estimation of weighted spectral tilt is 
calculated by Equation 3 1 . 

k w (*) = K(k mod 2) - r (k). < k n (*) > (Equation 31) 

5 Estimation of weighted absolute maximum is calculated by Equation 32. 

x w (*) = z(k) - r(k)- < x N (*) > (Equation 32) 

Estimation of weighted pitch correlation is calculated by Equation 33. 

(*) = & P - r (*> < R N p (k) > (Equation 33) 

The evolution or change of the weighted tilt and the weighted maximum is calculated 
10 according to Equations 34 and 35, respectively, as the slope of the first order 
approximation. 



X/-(/r w (A-7 + 0-^(*-7)) 
d ^ (*) = — 5 (Equation 34) 

E> 2 



Z / "^(*" 7 + 0"^(*-7)) 
d %* <*) = — y (Equation 35) 

S^ 2 



12 

1=1 



Once the parameters of Equation 25 through 35 are updated for the 8 sample points of the 
15 frame, the following frame based parameters are calculated from the parameters defined 
by Equations 25 though 35. The maximum weighted pitch correlation is calculated by 
Equation 36. 

= mzx{R wp (* - 7 + /)> / = 0,1,...,7} (Equation 36) 

The average weighted pitch correlation is calculated by Equation 37. 



27 



PATENT 
10508.28 
99RSS485 

R Z ^2X/>(*- 7+/ > (Equation 37) 

The running mean of average weighted pitch correlation is calculated by Equation 38, 
< RZ (m) >= a 2 • < R^ p (m-l)> +(1 - a 2 ) • R% , (Equation 38) 

where m is the frame number and a 2 = 0.75 is the adaptation constant. Normalized 
standard deviation of pitch lag is calculated by Equation 39, 



\j^{L p {m-2^l)-fx Lp {m)f 
° Lp («) — \ — (Equation 39) 

where L p (m) is the input pitch lag, and p Lp (m) is the mean of the pitch lag over the past 
three frames that can be expressed by Equation 40. 

1 2 

Ml, W = tE (L p (m - 2 + 1) (Equation 40) 

1 0 The minimum weighted spectral tilt is calculated by Equation 4 1 . 

/C n =min{*r w (*-7+/),/ = 0,l,...,7} (Equation 41) 

The running mean of minimum weighted spectral tilt is calculated by Equation 42. 
< < m (m) >= a 2 . < < in {m ~ 1) > +(1 -a 2 ) • *r (Equation 42) 

The average weighted spectral tilt is calculated by Equation 43. 

15 < 8 = 72>-<*- 7 + / > (Equation 43) 

The minimum slope of weighted tilt is calculated by Equation 44. 

d<f = min{3^(*-7+/), / = 0,1,...,7} (Equation 44) 
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The accumulated slope of weighted spectral tilt is calculated by Equation 45. 

dK r = £ 3k w (k - 7 + /) (Equation 45) 

The maximum slope of weighted maximum is calculated by Equation 46. 
dxT = ™*4 d Z. (* - 7 + /), / = 0,1.-7} (Equation 46 

5 The accumulated slope of weighted maximum is calculated by Equation 47. 

ty? = Z d *» <* " 7 + 0 (Equation 47) 

7=0 

The decision boundaries are complex, and the actual thresholds are operable to be 
programmed. Preferably, the parameters given by Equation 44, 46, and 47 are used to 
mark whether a frame is likely to contain an onset, and the parameters given by Equation 
10 37, 38, 39, 41, 42 and 43 are used to mark whether a frame is likely to be dominated by 
voiced speech. Based on the initial marks, the past marks, and the VAD information, the 
frame is classified into one of four classes 0, 2, 3, or 5. 

5. Classification to Control Pitch Pre-Processing. 

The pitch pre-processing is controlled with a classifier that distinguishes between 
15 six categories. The categories are labeled numerically between -1 through 4. The 
module is based on the VAD information, the unvoiced noise-like detection, the signal 
characteristic classification, and the pitch correlation of the second half of the frame. 
The Class -1 is used to reset the pitch pre-processing to prevent an accumulated delay 
introduced during pitch pre-processing that exceeds the delay budget. In this 
20 embodiment, the remaining classes may indicate an increasing voicing strength and may 
be based on the pitch correlation information. 

A. Waveform Interpolation and Pitch Pre-Processing. 



29 



PATENT 
10508.28 
99RSS485 

The waveform interpolation and pitch pre-processing module has four functions. 
First, the signal is modified to better match the estimated pitch track and more accurately 
fit the coding model while being perceptually indistinguishable from the unmodified 
signal Second, certain irregular transition segments are modified to better fit the coding 

5 model. The modification enhances the regularity and suppresses the irregularity using 
forward-backward waveform interpolation. Again, the modification occurs without a loss 
of perceptual quality. Third, the pitch gain and pitch correlation for the modified signal 
are estimated. Finally, the signal characteristic classification is refined based on the 
additional signal information that is obtained during the analysis for the waveform 

1 0 interpolation and pitch-preprocessing. 

6. Pitch Pre-Processing. 

The pitch pre-processing occurs on a frame-by-frame basis. The analysis and 
signal modification are based on the perceptually weighted speech rather than the LPC 
residual signal. Preferably, the system performs continuous time warping as opposed to 

15 simple integer sample shifting of the signal. The warping introduces a variable delay of a 
maximum of approximately 20 samples (or about 2.5 ms) at the encoder. The delay is 
limited to a maximum of approximately 20 samples so that the system does not exceed 
the overall maximum delay according to the ITU-T terms of reference. The time-warped 
signal is estimated using Hamming weighted Sine interpolation filters. The signal is 

20 preferably modified on a pitch cycle by pitch cycle basis. During the analysis certain 
overlap between adjacent pitch cycles is incorporated to avoid discontinuities between 
the reconstructed/modified segments. The signal is modified according to the input pitch 
track, which is derived from the lags of the past and current frames. 

The classification controls the pitch pre-processing in the following way. If the 

25 frame is predominantly background noise or unvoiced speech with a low pitch correlation 
(pitch pre-processing Class -1) the frame remains unchanged and the accumulated delay 
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of the pitch pre-processing is reset to zero. If the signal is pre-dominantly pulse-like 
unvoiced speech (pitch pre-processing Class 0) 5 the accumulated delay is maintained 
without any warping of the signal, and the output signal is a simple time shift (according 
to the accumulated delay) of the input signal. For the remaining pitch pre-processing 
5 classes the core of the pitch pre-processing method is executed to optimally warp the 
signal. 

A. Estimate Segment Size. 

The segment size is preferably equal to the pitch period, though some adjustments 
may be necessary. In general, the pitch complex (the main pulses) of the pitch cycle is 
10 located towards the end of the segment to allow for maximum accuracy of the warping on 
the perceptual most important part, the pitch complex. For a given segment the starting 
point is fixed by the past and the end point is moved for a best fit, which stretches or 
compresses the time scale. Consequently, the samples at the beginning of the segment 
are shifted only slightly, while the end of the segment has the greatest shift. 

15 B. Estimate Target Signal for Warping. 

The target signal for the time-warping is a synthesis of the current segment 
derived from the previous modified weighted speech s f w (n) and the input pitch track 
L p (n). According to the pitch track L p {n), each sample value of the target signal 
sl{n\n-0^ 9 N s -\ is obtained by interpolation of the previously modified weighted 
20 speech using a 21 st order Hamming weighted Sine window as expressed by Equation 48. 

sl(n) = X^(f(^( w )) 5 /)-^(«-i(^W)) for n = Q 9 ... 9 N, -1 (Equation 48) 

;=-10 

In Equation 48, i(L p (n)) and f(L p (n)) are the integer and fractional parts of the pitch lag, 
respectively, w s (fj) is the Hamming weighted Sine window, and A^is the length of the 
segment. 
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C. Estimate Warping Function. 

The warping function is estimated to maximize the normalized correlation 
between the weighted target signal and the warped weighted speech, i.e., by maximizing 
Equation 49, 

Zi S » W ' 0* + r acc)) 

d = n=o (Equation 49) 



«=o J v«=° 



where s w (n + T w ) is the weighted speech shifted according to the accumulated delay r acc of 
the past pitch pre-processing, f warp (*(")) is the warping function, and ^(«)is the weighted 
target that can be expressed as Equation 50. 

s w J (h) = w e (n) • sl(n) (Equation 50) 

10 The weighting function w e (n) is a two-piece linear function emphasizing the pitch 

complex and de-emphasizing the "noise" that occurs between pitch complexes. The 

weighting is adapted according to the pitch pre-processing classification increasing the 

emphasis for segments of higher periodicity. 

The warping function is estimated by initially estimating the integer shift that 
15 maximizes the normalized cross correlation between the weighted target and the 

input weighted speech s„0 + r acc ) according to Equation 51, 

, sh]ft =argmax{^(r shlft ),r shlft = r 0 ,..,r,} (Equation 51) 

where 



n (t \- "=o (Equation 52) 



«=0 ) \»=0 
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and t 0 and r, specify the search range. The refined shift (including fractional shift) is 
determined by searching an upsampled version of R n (r sm ) in the vicinity of r shift . This 
seach results in the calcualtion of the final optimal shift r opt and the corresponding 
normalized cross correlation R n (r ovt ) . 

5 D. Estimate Warped Signal. 

The modified weighted speech for the segment is reconstructed according to the 
mapping can be expresses as: 

k(" + 0>^(« + *» c +r opt )]->[s;(»),5;(/i + r c -1)], (Equation 53) 

and 

1 0 [s w (« + r acc + r c + r opt ), s w (n + x acc + r opt + N s - 1)] -> (n + r c ), s ' w (n + 7V S - 1)] , 

(Equation 54) 

where r c is a parameter defining the warping function. The mappings specify the 
beginning of the pitch complex. The mapping given by Equation 53 specifies a time 
warping, and the mapping given by Equation 54 specifies a time shift (no warping). 
1 5 Both are calculated by using a Hamming weighted Sine window function. 

7. Waveform Interpolation. 

The waveform interpolation is integrated with the pitch pre-processing. It is 
performed on a pitch cycle by pitch cycle basis equivalently to the pitch pre-processing. 
The waveform interpolation is performed following the estimation of the warped signal at 
20 the pitch cycle level, i.e., reconstruction of the modified weighted speech. The main 
objective of the waveform interpolation is to improve the onsets. Suppose that the 
current segment contains the first main pitch complex (pulse) of the voiced segment. 
This means that the correlation with the past will be low and pitch pre-processing will 
have little benefit. In order to facilitate a rapid build-up of the onset in the following 
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segments, the current segment (pitch cycle) is modified as the weighted sum of the past 
pitch cycle and the following pitch cycle if the benefit is significant. This will artificially 
increase the pitch correlation for the next segment, and enhance the contribution from the 
pitch pre-processing in the future. Consequently, this will increase the contribution from 
5 the adaptive codebook during onsets resulting in a faster build-up. 

A candidate segment (to replace the current segment) is estimated by predicting 
the current pitch cycle from the past (forward prediction) and the future (backward 
prediction). The forward prediction is already available as the target for the pitch pre- 
processing, Equation 48, or i.e., 

1 0 VfW («) = s' w («) . (Equation 55) 

The backward prediction v bw («)is derived as the shift of the next pitch cycle of the 
original weighted speech that results in the best match to the modified weighted speech of 
the pitch processing, i.e., 

-argmaxfcr (r s b ~ ),r& ^..^J, (Equation 56) 

1 5 where 

N t -l 

X w eW ' < (") -s w (n + L p + r acc + r opt + z^ f ) 
(r s b h ; ) = — — , (Equation 57) 



bw \2 
shif ) 

n=0 J V "=° J 



and r 0 and r 1 specify the search range. The weighting function w e (n) is similar to the 
weigthing during the pitch pre-processing. The refined shift (including fractional shift) is 
determined by searching an upsampled version of R^ir^) in the vicinity of r^ ft . This 
20 results in the final optimal shift r^ t and the corresponding normalized cross correlation 
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^ bw (0 • Based on the final optimal shift the backward prediction is derived by Equation 
58, 

i=-\0 

(Equation 58) 

5 where i(Y) and f (r) are the integer and farctional parts of the argument t , respectively, 
w s (fJ) is the Hamming weighted Sine window, and N s is the length of the segment. 

The forward and backward predictions are combined to form the predicted 
segment according to Equation 59, 

v p (n) = g„ • ( Vfw (n) + p - v bw («)) , (Equation 59) 

10 where p is 1 if the backward prediction is successful (^(r^) above certain threshold) 
and 0 if the backward prediction is unsuccessful. The gain factor g n normalizes the 
energy of the predicted segment to the energy of the modified weighted speech from the 
pitch pre-processing, i.e., 



gn = 



/V-l 

. (Equation 60) 



0 + /?-v bw («)) 2 



15 The final candidate for the segment v c (n)is calculated as a weighted sum of the 

predicted segment v p (n) and the output segment from the pitch pre-processing s' w {ri) 
according to Equation 61, 

v c {n) = a{n)'sl{n) + {\~a{n))'V p {n) (Equation 61) 

where the weighting provides a smooth transition from v c (n) to s' w (n) at the beginning of 
20 the segment and at the end of the pitch cycle. 
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The candidate segment v c (n) only replaces the output segment from the pitch pre- 
processing if it provides a better match to the weighted target signal given by Equation 
50, or i.e., 



v c («) 



if 



Yv c (»)<'<«) 



> i ^»(>t) 



(Equation 62) 



otherwise 



In addition, the replacement is also contingent upon the absolute match tf w (V opt ) of the 

pitch pre-processing. Preferably, the candidate from the waveform interpolation is 
accepted if the pitch pre-processing fails and the candidate provides a good match to the 
target signal. 

8. Pitch Gain and Pitch Correlation Estimation. 

The pitch gain and pitch correlation is available on a pitch cycle basis. The pitch 
gain is estimated in order to minimize the mean squared error between the target s' w (n) 9 
Equation 48 and the final modified signal s' w (n) , Equation 62, and is given by Equation 
63. 



n=Q 

The pitch correlation is given by Equation 64. 



(Equation 63) 
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N s -l 




(Equation 64) 



Both parameters are available on a pitch cycle basis and are linearly interpolated in order 
to estimate the parameters at the regular three subframes per frame. 

9. Refine Signal Characteristic Classification. 

5 Based on the average pitch correlation and pitch gains estimated during pitch pre- 

processing the Class 6, "Stationary Voiced" is introduced. Furthermore, based on a 
refined noise-like unvoiced detection the Class 1, "Noise-Like Unvoiced Speech/ 5 is 
distinguished. This completes the signal characteristic classification. 

A. Mode Selection. 

10 The mode selection is controlled by the signal characteristic classification. If the 

frame is classified as "Stationary Voiced/' Class 6, the frame is encoded using Mode 1. 
For Class 0 through 5, the frame is encoded using Mode 0. The mode information is 
added to the bit-stream and transmitted to the decoder. 

The two modes are referred as suitable for "non-periodic"-like and "periodic"-like 

15 frames. However, this labeling should be interpreted with some care. The frames 
encoded using Mode 1 are those maintaining a high pitch correlation and high pitch gain 
throughout the frame based on the pitch track derived from only 7 bits per frame. 
Consequently, the selection of Mode 0 rather than Mode 1 can be due to an inaccurate 
representation of the pitch track with only 7 bits, and not necessarily due to the absence 

20 of periodicity. Hence, signals encoded with Mode 0 may contain periodicity, though not 
well represented by only 7 bits per frame for the pitch track. Therefore, Mode 0 encodes 
the pitch track with 7 bits twice per frame (14 bits total per frame) in order to represent 
the pitch track properly. 
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10. Mode 0 Processing. 

If the mode selection in the Fig. 3 dictates Mode 0, the encoding proceeds 
according to the mode optimized for "non-periodic"-like signals. A block diagram of the 
Mode 0 processing (subsequent to the processing of Fig. 3) is presented in Fig. 4. This 
5 mode is similar to the traditional CELP encoding of G.729. In Mode 0, the frame is 
divided into two subframes. All functions in the block diagram are executed on a 
subframe basis. 

A pitch track is provided to an adaptive codebook 410 as shown in Fig. 4. A 
code-vector, shown as v a , is provided from the adaptive codebook 410. After passing 

10 through a gain stage, it is fed into a synthesis filter 412. The output of the synthesis filter 
412 is passed through a perceptual weighting filter 414 that generates an output that is 
received by a first summing junction. The first summing junction also receives an input 
from a modified weighted speech. The modified weighted speech is also received by an 
analysis of energy evolution block 450 and an energy processing block 460. The energy 

15 processing block 460 comprises an energy normalization block 462, an energy smoothing 
block 464, and a generate energy-correct target block 466. The output of the first 
summing junction is fed to a minimization block 411 that generates an output used to 
modify selection within the adaptive codebook 410. That output is also fed to a second 
summing junction. 

20 A fixed codebook 420 provides a code-vector, shown as v c , to a gain stage that 

generates an output received by a synthesis filter 422. The output of the synthesis filter 
422 is passed through a perceptual weighting filter 424 before being received by a second 
summing junction. The output of the second summing junction is fed to a minimization 
block 421 that generates an output used to modify selection within the fixed codebook 

25 420. Control information is also provided to the minimization block 421 . 
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In addition, a two dimensional vector quantization (2D VQ) gain codebook 470 
provides input to two gain stages, and the outputs from those gain stages are passed to a 
synthesis filter 472 after being combined at a third summing junction. The output of the 
synthesis filter 472 is passed through a perceptual weighting filter 474 before being 

5 received by a fourth summing junction that receives input from the energy processing 
block 460 via a modified target signal. Control information and the code-vectors v a and 
v c are used to generate the modified target signal. The output from the fourth summing 
junction is received by a minimization block 471 that generates an output received by the 
2D VQ gain codebook 470. 

1 0 A. Adaptive Codebook Search. 

The contribution from the adaptive codebook (the past excitation) is specified 
with 7 bits. The 7 bits represent a delay from 17 to 127 samples. The delay (pitch 
period) is non-uniformly distributed and includes fractional delays between about 17 and 
40 samples, and only integer delays above about 40 samples, 

15 Initially, the integer lag from the open loop pitch estimation is refined. The 

search minimizes the weighted mean-squared error (WMSE) between the original and 
reconstructed speech. The cross-correlation function is searched within a range of three 
samples of the open loop pitch estimate according to Equation 65, 



20 where L p is the open loop pitch estimate, and L ! p is the refined integer pitch lag estimate. 
The cross-correlation function R(l) is expressed by Equation 66, 



L J p = argmax{i?„(^),^ =L p -3...,L„ -3} 



(Equation 65) 



79 



£/(n)-(e(»-0 **(«)) 



(Equation 66) 
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where *0)is the target signal, e(n)is the excitation (the adaptive codebook), and h{n) is 
the perceptually weighted impulse response of the LPC synthesis filter. The relationship 
between the excitation e(«)and the vector from the adaptive codebook vjcan be 
expressed as: 

5 e(n - 1) = vf (/) («) , (Equation 67) 

where the function maps the delay/lag / to the proper index. 

The final pitch lag (adaptive codebook contribution) is determined by searching 
the entries in the adaptive codebook that correspond to lags that are within one sample of 
the refined integer lag. This may or may not include fractional lags depending on the 
10 value of the refined integer lag. The cross-correlation function given by Equation 66 is 
interpolated and low-pass filtered using a 13 th order Hamming weighted Sine window to 
provide the cross-correlation at the relevant lags. 

The optimal WMSE pitch gain is estimated according to Equation 68, 

)**(»)) 

g = bounded by 0.0 < g < 1 .2 (Equation 68) 

£«w-£*>*(«)) 2 

15 where Lfis the final pitch lag calcualted to minimize the WMSE between the orignal 
speech signal and the reconstructed speech signal. The unquantized pitch gain is 
calculated according to the following weighting of the optimal pitch gain expressed in 
Equation 69, 

g a ={\ R ^ L °P + \}s (Equation 69) 

20 where the normalized cross-correlation is given by Equation 70. 
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79 



£f(«)-(«(»-£f )**(«)) 



(Equation 70) 




2>w 2 - £m»-j?)*a(«)) 2 



This weighting de-emphasizes the pitch contribution from the adaptive codebook 
prior to the fixed codebook search, leaving more of the pitch information in the target 
signal for the fixed codebook search. 



The fixed codebook excitation is represented by 15 bits in Mode 0. The codebook 
has three sub codebooks, where two are pulse codebooks and the third is a Gaussian 
codebook. The 2-pulse codebook has 16384 entries, the 3 -pulse codebook has 8192 
entries, and the Gaussian codebook has 8192 entries. This adds up to a total of 32768 

10 entries equivalent to 15 bits. Weighting of the WMSE from the different sub codebooks 
is applied in order to favor the excitation most suitable from a perceptual point of view. 

The initial target for the fixed codebook is calculated from the weighted pre- 
processed speech with the zero-response removed, i.e., the target for the adaptive 
codebook and the optimal adaptive codebook excitation and gain is calculated according 

15 to Equation 71. 

t'(n) = t(n) - g a • {e(n - Lf ) * h(n)) (Equation 71) 

The perceptual weighting for the search of the fixed codebook is adapted 
according to the instantaneous Noise to Signal Ratio (NSR) by an adaptive 1 st order 
filter. When the NSR is above -2.5 dB (the signal is less than 2.5 dB above the noise 

20 floor) a 1 st order filter is added to the regular perceptual weighting filter. This additional 
weighting filter is introduced by filtering both the target t'(n) and the LPC synthesis filter 
response h(n) prior to the codebook search. The 1 st order filter is preferably defined by 
Equation 72, 



5 



B. 



Fixed Codebook Search. 
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H' w (z) = — l -— (Equation 72) 

1-7JZ 

where the filter coefficient 77 is calculated as follows: 

»=30 

?] = -0.25 • 77=1 w=30 . (Equation 73) 

5>(«) 2 

Preferably, an objective of the filter of Equation 72 is to provide slightly better 
5 matching of the high frequency energy in high-level background noise segments. The 
resulting target and synthesis filter response are denoted t\n)and /*"(«), respectively. 
When the signal is more than 2.5 dB above the noise floor no additional weighting is 
applied, i.e., t"(n) = t\n) and h\ri) = h(ri) . 

Prior to the search of the three sub codebooks, some characteristics are built into 

1 0 the excitation of the two pulse sub-codebooks to enhance the perceptual quality. This 
may be achieved by modifying the filter response of the synthesis filter for the codebook 
search. The first characteristic is introduced with a phase dispersion filter that spreads 
the pulses of the two pulse codebooks. The filter is preferably fixed and modifies only 
the high-frequency phase. The filter is designed in the frequency domain with zero-phase 

15 and unity magnitude at frequencies below 2 kHz, and with an appropriate pseudo random 
phase and unity magnitude at frequencies above 2 kHz. The filter may be transformed 
into the time domain resulting in the impulse response h n (n) . The phase dispersion is 
preferably incorporated into the synthesis filter response for the codebook search 
according to Equation 74. 

20 h, (/?) = h„ («) * h\n) (Equation 74) 

Second, for pitch lags greater than a subframe size (80 samples) the traditional 
pitch enhancement does not contribute to the fixed codebook excitation. To compensate 
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the resulting sparseness for high pitch lags, 2 correlation factors of delays less than the 
subframe size are adaptively inserted by modifying the response of the synthesis filter for 
the codebook search. The inserted enhancement, increases the density of the fixed 
codebook excitation. The strength and relative delay of the 2 correlation factors are 
estimated based on a WMSE between the quantized weighted speech of a past subframe 
and a segment delayed further between about 3 to 79 samples. The current subframe 
incorporates the two most significant correlation factors of the most recent past. Since 
the estimation is based on the past, the decoder is able to perform the identical 
operation. The two delays are calculated by Equation 75 , 

L c =max{/2(/),/ = 3A".,79} 



(Equation 75) 



where the correlation function is given as follows: 



R(I) = 



( 79 > 

£Sw(«-8o+y)-s w («-8o-/+/) 

j=0 



(Equation 76) 



where s w (n)is the weighted past quantized speech and S w (z) is expressed by Equation 
77, 

1 



15 S w (z) = 



>E(z) 



(Equation 77) 



where E(z) is the past quantized excitation. This results in the two optimal delays L cl and 
L c2 . The gain for each of the two delays is estimated as a weighting of the normalized 
cross-correlation, and the filter is given by Equation 78, 
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K(n) = 



£s ir (»-80 + y)-5 w (n-80-L c +7) 



£^(*-80-i c +;) 2 • nTs w (»-80+y) 2 



n = 



(Equation 78) 



where the weighting factor >S is 0.25 when the delay is in the vicinity of the pitch lag and 
0.5 otherwise. The final response of the synthesis filter for the search of the two pulse 
codebooks can be expressed by Equation 79. 

5 h 2 in) = h c2 * (h el * h x («)) (Equation 79) 

The modifications of the excitation described by Equations 74 and 79 are only 
done for the two pulse sub codebooks, and not for the Gaussian sub codebook. 
Therefore, for the search of the Gaussian sub codebook the unmodified response of the 
synthesis filter h\n) is used. 

10 C. 2-Pulse Codebook. 

The 2-pulse codebook is searched in a closed-loop to minimize the perceptually 
weighted mean-squared error between the original and the reconstructed speech. The 
MSE between the candidate excitation convoluted with the weighted response of the 
synthesis filter given by Equation 79 and the target t\n) that is minimized according to 

15 Equation 80, 



c 2P (n) = argmax< 



f 79 

2y(«)-(c 2F (»)*/* 2 («)) 

v»=o 



2>2/>(»)*M")) 2 



c 2P (h) e {c 2f (»),/ = 0,...,1 6383} 



(Equation 80) 
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where c 2P (n),l = 0,... ,16383 are the candidate excitation vectors from the 2-pulse codebook, 
and c 2P (n) is the best candidate. Each pulse is restricted to a track where 6 bits specify 
the position in the track, and 1 bit specifies the sign of the pulse. This is equivalent to a 
total of 14 bits (16384 entries). The two tracks may be constructed from the following 5 
sub tracks: 

T 0 : {0 9 5 ? 10 ? 15 5 20 ? 25 ? 30 5 35 ? 40 ? 45 ? 50 ? 55 3 60 ? 65 ? 70 ? 75} 
Ti: {1,6,11,16,21^6,31,36,41,46,51,56,61,66,71,76} 
T 2 : {2,7,12,17,22,27,32,37,42,47,52,57,62,67,72,77} 
T 3 : {3,8,13,18,23,28,33,38,43,48,53,58,63,68,73,78} 
T 4 : {4,9,14,19,24,29,34,39,44,49,54,59,64,69,74,79} 
The tracks for the 2 pulses may be given by: 
T p i: TouTiUT 2 uT 3 
T p2 : T1UT2UT3UT4 

where each track has 64 pulse positions (6 bits). 

Pitch enhancement is applied to the 2-pulse codebook in both forward and 
backward direction. This concept is illustrated in the Fig. 5. The forward-backward 
pitch enhancement is specified with the lag and gain g where 



where L p is the integer part of the pitch lag. It is incorporated into the pulses 
c l 2P {n\l = 0,... ,16383 when searching the codebook according to Equation 80. Preferably, a 
reduced complexity search is applied to maintain low complexity. 




(Equation 81) 



and 




(Equation 82) 
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D. 3-Pulse Codebook. 

The 3 -pulse codebook is searched in a closed-loop to minimize the perceptually 
weighted mean-squared error between the original and the reconstructed speech. The 
MSE between the candidate excitation convoluted with the weighted response of the 
synthesis filter is given by Equation 79 and the target f*(«)is minimized according to 
Equation 83 , 



c 3P («) = arg max 



U ° J —,c 3P (n)e{4 P (n)J = Q,^9l} 



(Equation 83) 

where <4(«),/ = 0,...,8l91 are the candidate excitation vectors from the 3-pulse codebook, 
10 and c 3P (n) is the best candidate. The 3-pulse codebook is constructed by a specification 
of an absolute point by 4 bits (from a track of 16 positions) and the position of each of the 

3 pulses relative to the absolute point with 2 bits and 1 bit for the sign. This results in 

4 + 3 . (2 + 1) = 1 3 bits or 8 1 92 entries. The track for the absolute point is expressed below: 
T abs : {0 ? 4 ? 8 ? 12 ? 16 5 20 5 24 ? 28 ? 33 5 38 5 43 ? 48 s 53 ? 58 ? 63 ? 68} ? 

1 5 and the relative tracks for the 3 pulse are: 
ATpi: {0,3,6,9} 
AT p2 : {1,4,7,10} 
AT p3 : {2,5,8,11} 

Pitch enhancement is applied to the 3-pulse codebook using forward-backward pitch 
20 enhancement as illustrated in Fig. 5. The parameters are expressed in Equations 84 and 
85, 
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_ L p L p <S0 

pe ~\jL p L p >%0' (Equation 84) 

and 



_ J0.250 L p < 80 

gpe ~\0.\25 L > 80 ' (Equation 85) 



where L p is the integer part of the pitch lag. The pitch enhancement is incorporated into 
the pulses c' 3P (n),l = 0,...,8192 when searching the codebook according to Equation 83. 
Preferably, a reduced complexity search is applied to maintain low complexity. 
E. Gaussian Codebook. 

The Gaussian codebook is searched in a closed-loop to minimize the perceptually 
weighted mean-squared error between the original and the reconstructed speech. The 
Gaussian excitation vector is constructed from two orthogonal basis vectors, where the 
first basis vector contains all the even sample points in the subframe, and the second basis 
vector contains all the odd sample points. Each basis vector has a dimension of 40, and 
the same pseudo Gaussian codebook is used for the two basis vectors. For each basis 
vector, 45 candidates are considered and 1 bit is used to specify the sign. This results in 
(45 -2) (45 -2) = 8 100 entries which is specified by 13 bits. The remaining 192 entries are 
not used. 

In order to reduce the complexity, two candidates are pre-selected in open-loop 
for each basis vector by maximizing the cross correlation functions |/?,(/)| and |/? 2 (/)|, 
respectively, as seen in Equations 86 and 87, 

39 

*iW = E e 2 (2- / = 0,...,44, (Equation 86) 

39 

^ 2 (0 = X e 2(2-« + l)-cL(") 5 / = 0,...,44, (Equation 87) 
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where e 2 (n)is the residual after removal of the contribution from the adaptive codebook, 
and c^ fl O) ; / = 0 v ..,44is the candidate excitation vectors from the Gaussian codebook (the 
same for both basis vectors). The signs of the candidate vectors sgn(/?,(/)) and sgn(i? 2 (/)) 
are determined as the sign of the correlation functions R { (/) and R 2 (/) for the respective 
candidate vectors. The final candidate is determined among the four possible 
combinations of the pre-selected candidates for the two basis vectors by maximizing 
Equation 88, 



i?(/ 1; / 2 ): 



\ 2 



Ete 2 («)*/*») 2 

J=0 



/, = L,(0) 9 L 1 (1), l 2 = 2, 2 (0),Z, 2 (1) (Equation 88) 



where z^O),/^) and L 2 (0), L 2 (1) specifies the candidate vectors for the two basis 
vectors. The Gaussian code vector is reconstructed according to Equation 89. 



sgnfo (/,))• 4 

sgn(/e 2 (/ 2 ))-4 



n-\ 



n even 



n odd 



« = 0,...,79 



(Equation 89) 



No pitch enhancement or enrichment of the excitation as specified by Equation 74 and 
79, respectively, is performed for the Gaussian sub codebook. 

F. Final Selection. 

The selection of the final fixed codebook excitation involves comparing the 
WMSE of the best candidate from each of the three sub codebooks after applying 
appropriate weighting according to the classification information. The modifications of 
the excitation described by Equations 74 and 79 and the forward-backward pitch 
enhancement are incorporated into the excitation when appropriate, i.e., if the final fixed 
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codebook excitation is selected from the either of the two pulse codebooks. The final 
fixed codebook excitation is denoted v c or v c {ri) . 

G. Analysis of Energy Evolution. 

The "Analysis of Energy Evolution" distinguishes segments of stationary 
5 background noise from segments of speech, music, tonal-like signals, non-stationary 
noise, etc., to control the amount of energy smoothing. The energy smoothing factor 
/? £ is based on the detection of subfirames of stationary background noise. The 
classification may appear as a regular VAD, however, the objective of the classifcation is 
distinctly different from the VAD. While the VAD is optimized so that speech is not 

10 miss-classified, the detection of stationary background noise is optimized so that 
stationary background noise is not to miss-classified. Occasional miss-classification of 
border cases of "non-stationary background noise" causes only minor degradation. 
Furthermore, the detection of stationary background noise is subframe based, and thus 
has slightly improved time resolution. Consequently, the detection of stationary 

15 background noise is significantly different from the VAD, and neither is capable of being 
substituted for the other. 

The detection of stationary background noise is performed on a subframe basis 
and takes place in two steps. Initially, a detection based on the pitch pre-processed 
speech occurs. Next, the detection is refined using the residual signal after the adaptive 

20 codebook contribution is removed. 
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H. Initial Detection. 

The initial detection is based on the pitch pre-processed speech, the VAD 
information, the pitch lag, and the 1 st reflection coefficient (representing the tilt). Based 
on these parameters, Equations 90 through 97 are solved. The pitch correlation can be 
expressed as Equation 90. 

79 

R = " =Q (Equation 90) 



\ ( 79 A 

\\»=o J V«=o J 



The running mean of pitch correlation can be expressed as Equation 91 . 

< r (n) >= 0.9 * R Lp (n - 1) + 0. 1 * R Lp (Equation 9 1) 

The maximum absolute signal amplitude in current pitch cycle can be expressed as 
10 Equation 92, 

z (k) = max j n e C pc } (Equation 92) 

where C pc is the set of samples belonging to the current pitch cycle. The accumulated 

absolute signal amplitudes can be expressed as Equation 93. 

¥ {k) = £ \ s (ri)\ (Equation 93) 

1 5 The Signal to Noise Ratio of Maximum (SNRM) can be expressed as Equation 94, 

SNR Z (*) = Z(k) , (Equation 94) 

<XnW> 

where <z N ( k )> is a running mean of the maximum absolut signal amplitude of 
subframes that are predominantly stationary background noise. The absolute signal 
maximum in groups of 3 subframes can be expressed as Equation 95. 
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Z g * (* - 7) = max{*(* - 3 • j - 2), z (k - 3 • j - 1), #(* ~ 3 • y)} y = 4,3,...,0 

(Equation 95) 

The Steepest maximum evolution can be expressed as Equation 96. 

5y * = z ^ k) T (Equation 96) 

5 The linear maximum evolution (slope of MSE-fit to maximum in groups of 3 subframes) 
can be expressed as Equation 97. 

ty% =0.1-X^(* + i)-(/ + 2) (Equation 97) 

;=-4 

Based on the parameters given by Equation 91, 94 and the VAD information, the 
stationary background noise is detected. Furthermore, functionality to detect long-term 

10 decrease or increase in the background noise level and perform appropriate resets is 
based on the parameters calculated by Equations 90, 91, 94, 96 and 97. Finally, the 
update of the running mean of the maximum absolute signal amplitude of subframes that 
are predominantly stationary background noise are controlled by the parameters given by 
Equations 90, 91, and 94 and the reset information. 

15 I. Refined detection. 

The refined detection is based on the parameters given by Equations 90 and 91, 
and the SNRM of Equation 94 with the exception that the refined detection is based on 
the pitch residual rather than the pre-processed speech and the initial detection. 
J. Energy Smoothing Factor. 

20 The energy smoothing factor /? £ is determined based on the refined detection process 

outlined below. 

1. At the beginning of stationary background noise segments, the smoothing 
factor is preferably aramped quadractic from 0.0 to 0.7 over 4 subframes. 
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2. During stationary background noise segments the smoothing factor 
preferably is 0.7. 

3. At the end of stationary background noise segments the smoothing factor 
is reduced to preferably 0.0 preferably instantaneously. 

5 4, During non-" stationary background noise segments" the smoothing factor 

is preferably 0.0. 

It should be noted, that although the energy smoothing is not performed during the Mode 
1 operation, the energy smoothing factor module may still be executed to keep memories 
current. 

1 0 K. Energy Normalization, Smoothing, and Correction. 

This module modifies the target signal prior to the gain quantization to maintain 
the energy contour (smoothness) of noise-like segments and avoid the typical "waterfall" 
effect of CELP coding at especially low bit-rates. Preferably, the energy smoothing is 
directed towards segments of stationary background noise. The process estimates the 
15 unquantized gains, the smoothed open-loop and closed loop energies, the normalized 
gains and the new target signal for gain quantization. 
L. Estimation of Unquantized Gains. 

The unquantized gains are estimated in one of two ways depending on the 
normalized pitch correlation given by Equation 70. If the normalized pitch correlation is 
20 below approximately 0.6, the adaptive and fixed codebook gains are optimized jointly by 
minimizing the WMSE between the original reconstructed speech according to Equation 
98. 

{g a , g c } = arg min(x (*(«) - {(g a v a (») * h{n)) + (g c v c («) * h(nj))) 2 J (Equation 98) 

U=o J 

This results in the following estimates of the two gains: 
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ga = K,rKc K, C 'K,< (Equation 99) 

R a,a ' K,c ~ R a,c ' K,c 

and 

gc = ^r-g a ^a,c (Equation 100) 

R c,c 

where 

79 79 79 

5 * b ,,=I>»*m«))-'(>o, ^ =!>«(») **oo)-f("), tf^Ztaw*^")) 2 ' 



* c ,c = !>«(»)* , K, c = f>„(«)* Kn)Hv e (n)*Kn)). (Equation 101) 

If the normalized pitch correlation is above approximately 0.6, the adaptive and fixed 
codebook gains are disjointly optimized according to the WMSE between the original 
and reconstructed speech. For the adaptive codebook gain only the reconstruction from 
10 the adaptive codebook is considered, i.e., 



g a = arg mm\ £ (/(«) - (g a v a («) * h{nj)f \ , (Equation 102) 

n=0 



and the gain is given by 

g a =^-. (Equation 103) 

In fact, the optimal WMSE pitch gain is already calculated during the adaptive codebook 
15 search, see Equation 68, and no re-calculation is required. Next, the fixed codebook gain 
is estimated according to Equation 104, 

g c =argmin 

{Z (''(") - & v c (») * W)) 2 } > (Equation 104) 
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where 



tXn) = t(n)-(g a v a (n)*h(ri))- 



(Equation 105) 



Preferably, the gain can be expressed as 



79 



(Equation 106) 



79 



5 Preferably, an objective of the disjoint optimization for highly voiced subframes is to 
avoid "coincidental" correlation between the target and the fixed codebook to artificially 
reduce the pitch gain and cause unnatural fluctuations in the pitch gain. The disjoint 
optimization may result in a slightly higher WMSE. However, the overall perceptual 
quality is improved. 

10 M. Energy Smoothing. 

The target energy of both the quantized excitation and reconstructed speech is 
estimated according to the smoothing factor fi E derived during the analysis of the 
energy. The target energy of the quantized excitation (also referred as the open loop 
target energy) is given by Equation 107, 

15 E e (Jfc) = fi E • E e (k - 1) + (1 - fi E ) • J e{nf (Equation 107) 



where e(n) is the residual signal. The target energy of the reconstructed speech (also 
referred as the closed loop target energy) is given by Equation 108. 



79 



(Equation 108) 



e, (*) = p E ■ E s (k - 1) + a - p E ) ■ £ m 2 



N. 



Energy Normalization. 



54 



PATENT 
10508.28 
99RSS485 

Based on the smoothed open and closed loop energy targets, a open and a closed 
loop scaling factor for the codebook gains are estimated to match the energy targets. It 
should be noted that the smoothing is variable and may be zero. The open and closed 
loop scaling factors are given by Equations 109 and 110. 



5 g*=0.7. 



i 



bounded by g ol < X1 / 



(Equation 109) 



15 



i 



bounded by g cl < l -l> 



ga 



(Equation 110) 



10 Based on the attenuation of the LPC filter given by Equation 111 



ghPC 



fl>> 2 



bounded by g^ PC <0.8 (Equation 111) 



and the detection of stationary background noise during the analysis of the energy 
contour, the final scaling factory is determined as a linear combination of the open and 
closed loop scaling factors. 

For subframes that are not stationary background noise, the final scaling factor is 
estimated according to Equation 112. 



8lpc 
0.8 



Set 



bounded by 1 .0 < g scl < (l .0 + g~ LPC ) 

(Equation 112) 
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Hence, when the prediction gain of the LP model is high (having a strong formant 
structure) matching of the closed loop energy target is favored, and when the prediction 
gain of the LP model is low (having a flat spectral envelope) matching of the open loop 
energy target is favored. For stationary background noise subframes, the final scaling 
5 factor is estimated according to 

fl.l-So/ gd^goi (Equation 113) 

gscl lU-g* &,>&! ' 

where a weighting of the smaller of the two scaling factors is selected. 
O. Energy Correction. 

Based on the final scaling factor, the unquantized gains are modified according to 
10 Equation 114 and 11 5 5 

„ ' =e e (Equation 114) 

, (Equation 115) 

and the target is corrected to Equation 116. 

t(n) = g' a v a (n)*Kn) + g' c -v c (n)*Kn) (Equation 116) 

15 The correction of the target artificially increases the correlation between the target signal 
and the filtered excitation vectors to avoid the typical energy fluctuations for waveform 
matching (CELP coders) of noise-like signals. This phenomenon may be caused by an 
erratic correlation between the target and the filtered excitation vectors caused by a low 
bit-rate excitation. It should be noted that without modifying the target prior to the gain 

20 quantization the energy normalization, smoothing, and correction have no effect. 
P. Gain Quantization. 

The adaptive and fixed codebook gains are jointly vector quantized with 7 bits per 
subframe similar to the method of G.729. The 2-dimensional codebook is searched 
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exhaustively for the entry that minimizes the WMSE between the target given by 
Equation 1 16 and the reconstructed speech signal, i.e., minimizing Equation 1 17, 

E = ±( tin) -(g a vM)*Kn) + g c vM*Kn)f (Equation 117) 

where the quantized adaptive and fixed codebook gains are derived from the 7 bits 
codebook. The entries of the codebook contain the adaptive codebook gain and the 
correction factor for the predicted fixed codebook gain. The prediction of the fixed 
codebook gain is based on a 2 nd order MA prediction of the fixed codebook energy. The 
relation between the correction factor y k and the quantized fixed codebook gain is given 
by Equation 118, 

p — y ■% (Equation 118) 

where g c is the quantized fixed codebook gain and g c \s the predicted fixed codebook 
gain. The predicted fixed codebook gain is based on a 2 nd order MA prediction of the 
fixed codebook energy, and is given by Equation 1 19, 

~ =10 ^ fe "^ ) (Equation 119) 

Sc 

where the E = 30dB is the mean energy, 

^=101o gl o(^Iv c (^], (Equation 120) 

and E k is defined by Equation 121. 

^ = £&,.(201og 10 r*-,). (Equation 121) 

/=1 

The prediction coefficients of the MA prediction are {b„b 2 } = {0.6, 0.3} . 
11. Mode 1 Processing 
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In Mode 1 the signal encoding proceeds according to the mode optimized for 
"periodic'Mike signals. In Mode 1 a frame is divided into three subframes. Referring to 
Fig. 6, the processing within the dotted box is executed on a subframe basis with the 
index k denoting the subframe number. The remaining functions (outside the dotted box) 
5 are executed on a frame basis. Accordingly, a Mode 1 process can require buffering 
parameters for three subframes at the boundary between subframe and frame based 
processing, e.g., the pre-quantized pitch gains, quantized adaptive and fixed codebook 
vectors, target vector, etc. 

As shown in Fig. 6, a pitch track is provided to an adaptive codebook 610. In 

10 addition, unquantized pitch gains are provided to a three dimensional (3D) open loop 
vector quantization (VQ) block 675 and a code-vector, shown as v a , is generated by the 
adaptive codebook 610. After the code- vector v a passes through a gain stage that also 
receives input from the 3D open loop VQ block 675, the amplified code vector is fed into 
a synthesis filter 612. The output of the synthesis filter 612 is passed through a 

15 perceptual weighting filter 614 and on to a first summing junction that also receives input 
from a modified weighted speech. The modified weighted speech is also passed to an 
analysis of energy evolution block 650 and an energy processing block 660. The energy 
processing block 660 itself comprises an energy normalization block 662, and a generate 
energy-correct target block 666. 

20 A fixed codebook 620 provides a code-vector, shown as v c , to a gain stage and 

then to a synthesis filter 622. The output of the synthesis filter 622 is passed through a 
perceptual weighting filter 624 and then to the second summing junction. The output of 
the second summing junction is fed to a minimization block 621 that is used to modify 
selection within the fixed codebook 620. Control information is also provided to the 

25 minimization block 62 1 . 

In addition, two additional gain stages each provide input to a third summing 
junction, and the output from the third summing junction is passed to a synthesis filter 
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672. The output of the synthesis filter 672 is passed through a perceptual weighting filter 
674 and on to a fourth summing junction that receives input from the energy processing 
block 660 through a modified target signal. A buffering block 681 shows that the 
modified target signal is operable to be buffered for the three subframes before being 

5 passed to the fourth summing junction. Control information and the code-vectors v 1 ^ and 
are used to generate the modified target signal. The output from the fourth summing 
junction is received by a minimization block 671 that generates a signal that is received 
by a three dimensional (3D) vector quantization (VQ) gain codebook 670. The output 
from the 3D VQ gain codebook 670 is provided to the fourth gain stage. 

10 A. 3D Open Loop VQ of Pitch Gains. 

The 3 pitch gains derived during the pitch pre-processing are quantized open loop 
with a 4 bits 3-dimensional vector quantizer according to Equation 122. 

fe,g^}=argmin^ 

The low bit-rate is justified by the stable behavior of the pitch gains in Mode 1. The 
1 5 quantization is performed on a frame basis prior to any subframe processing. 
B. Adaptive Codebook Contribution. 

The adaptive codebook contribution is derived from the past excitation and the 
pitch track specified by the pitch pre-processing. Thus, an initial search of the adaptive 
codebook is not required. According to the interpolated pitch track L p (n) from the pitch 
20 pre-processing, each sample value of the adaptive codebook excitation is obtained by 
interpolation of the past excitation using a 21 st order Hamming weighted Sine window as 
shown in Equation 123, 

v. 00 = ± w.(f<V»)).0-«MM»>) (Equation 123) 



/=-10 



59 



PATENT 
10508.28 
99RSS485 



where e{n) is the past excitation, and f(L p (n)) is the integer and farctional part of 

the pitch lag, respectively, and w M (f,f) is the Hamming weighted Sine window. An 
optimal WMSE pitch gain is estimated by Equation 124 



£Kw)-(v fl («) */<«)) 



& N-l 



bounded by 0.0 < g < 12 (Equation 124) 



£(v a («) •*(»))* 



n=0 



5 to minimize the WMSE between the orignal and reconstructed speech signal. Note that N 
in Equation 124 is the variable subframe size. 

The unquantized pitch gain is calculated according to a weighting of the optimal 
pitch gain expressed in Equation 125, 

«.-(£<.♦!)«. (Equation 125) 

1 0 where the normalized cross-correlation is derived by Equation 1 26. 



R = 



(Equation 126) 

\Vn=0 J V"=t> J 

This weighting de-emphasizes the pitch contribution from the adaptive codebook prior to 
the fixed codebook search, leaving more of the pitch information in the target signal for 
the fixed codebook search. Note that the gain calculation of Equation 124, 125, and 126 
15 are similar to Equations 68, 69, and 70 of the adaptive codebook search in Mode 0. 
C. Fixed Codebook Search. 

The fixed codebook excitation is represented with 13 bits per subframe in Mode 
1. The codebook has three sub pulse codebooks. Preferably, the 2-pulse codebook has 
4096 entries, the 3-pulse codebook has 2048 entries, and the 6 pulse codebook has 2048 
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entries. This number of entries sums to a total of 8192 entries that can be addressed by 
13 bits. Weighting of the WMSE of the different sub codebooks is applied to favor the 
excitation most suitable to achieve the highest perceptual quality. 

The initial target for the fixed codebook is calculated from the weighted pre- 
5 processed speech with the zero-response removed, i.e., the target for the adaptive 
codebook, and the optimal adaptive codebook excitation and gain according to Equation 
127 which is equivlaent to Equation 71 of Mode 0. 

t'{n) = t(n) - ga . ( v » * h(n)) . (Equation 127) 

Like Mode 0, the perceptual weighting for the search of the fixed codebook is adapted 
10 according to the instantaneous Noise to Signal Ratio (NSR) by an adaptive 1 st order filter, 
as seen in Equations 72 and 73 This results in a modified target and synthesis filter 
response denoted by *"(>?) and //"(«), respectively. 

Like the fixed codebook search of Mode 0, the fixed codebook search of Mode 1 
builds characteristics into the excitation signal by modifying the filter response. 
15 However, the phase dispersion filter of Mode 0 is omitted and only the incorporation of 
the most significant correlation of the recent past is included in this mode. This 
procedure was described by Equation 75 through 79. Note that the fixed subframe size of 
Mode 0 in Equations 75 through 79 may be substituted for the variable subframe size of 
Mode 1. The response of the synthesis filter (like Equation 79) for the search of the pulse 
20 codebooks is defined by Equtation 128. 

\ («) = Ki * {Kx * *») • (Equation 128) 

In contrast to Mode 0, Mode 1 applies the traditional forward pitch enhancement 
by modifying the impulse response of the synthesis filter according to Equation 129, 

h 2 (n) = h pe (n)*hi(n) (Equation 129) 
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where the pitch enhancement filter is given by Equation 130. 



(Equation 130) 



D. 2-PuIse Codebook 

The 2-pulse codebook is searched in a closed-loop to minimize the perceptually 
weighted mean-squared error between the original and the reconstructed speech. The 
MSE between the candidate excitation convoluted with the weighted response of the 
synthesis filter expressed by Equation 129 and the target /"(/?) is minimized according to 
Equation 131, 



c 2P («) = argmax 



U ° ; ,c 2P (n)e{c' 2P (n),l = 0,...A095} 



2> 2P (*)*^(«)) 2 



(Equation 131) 



10 where c' 2P {n),l = 0,...,4095 are the candidate excitation vectors from the 2-pulse codebook, 
and c lp {n) is the best candidate. The pitch enhancement parameters of Equation 130 are 



/ =i L " L " <N 



and 



(Equation 132) 



L p <N 



pe [min{0.2.£ a ,0.2} L p >N* 



(Equation 133) 



15 where L p is the integer lag at the center of the subframe and N is the variable subframe 
size. 
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Each pulse is preferably restricted to a track where 5 bits specify track position, 
and 1 bit specifies the sign of the pulse. This is equivalent to a total of 12 bits (4096 
entries). The tracks for the 2 pulses can be expressed as T p i and T P 2 can be expressed as: 
T pl : 

5 {0,1,2,3,4,5,6,7,8,9,10,12,14,1^ 

T p2 : 

{1,3,5,7,9,11,12,13,14,15,16,17,18^^ 
9,51} 

Each track preferably has 32 pulse positions that can be addressed by 5 bits. Preferably, a 
1 0 reduced complexity search is applied to maintain low complexity. 

E. 3-Pulse Codebook. 

The 3 -pulse codebook is searched in a closed-loop to minimize the perceptually 
weighted mean-squared error between the original and the reconstructed speech signal. 
The MSE between the candidate excitation convoluted with the weighted response of the 
15 synthesis filter given by Equation 129 and the target t\n) is minimized according to 
Equation 134. 



c ip ( n ) - ar 8 max 



2V(") (c 3P (n)^h 2 (n)) 



f> 3 p(«)*A2(*)) 2 



-,c 3 „(w)€^(»),/ = 0,..,2047} 



? (Equation 134) 



where c[ P («),/ = 0,..., 2047 are the candidate excitation vectors from the 3-pulse codebook, 
and c 3P (n) is deemed the best candidate. The pitch enhancement parameters of Equation 
20 130 are 



L p L p <N 
" Ul, L p >N> 



(Equation 135) 
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and 

0 5 • g„ L p <N (Equation 136) 



8 " {min{0.2-g a ,0.2} L p > N ' 

where L p is the integer lag at the center of the subframe and N is the variable subframe 
size. The 3-pulse codebook is constructed by a specification of an absolute point by 3 
5 bits (from a track of 8 positions) and the position of each of the three pulses relative to 
the absolute point with either 1 or 2 bits and a 1 bit sign. This sum of bits 
3 + (2 + 2 + l) + 3 = llbits is equivalent to 8192 entries. The track for the absolute point is 

expressed below. 
T abs : {0,6,12,18,24,30,36,43}, 
10 and the relative tracks for the 3 pulse are shown as AT pb AT p2 , and AT p3 . 

AT pl : {2,4,6,8} 
AT p2 : {1,3,5,7} 
AT p3 : {0,9} 

Preferably, a reduced complexity search is applied to maintain low complexity. 
15 F. 6-Pulse Codebook. 

The 6-pulse codebook is searched in a closed-loop to minimize the perceptually 

weighted mean-squared error between the original and the reconstructed speech signal. 

The MSE between the candidate excitation convoluted with the weighted response of the 

synthesis filter given by Equation 129 and the target f(„)is minimized according to 
20 Equation 137, 
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c 6P (n) = arg max^ 



(n-\ \ 2 

X^(«)-(^(")**2(»)) 



,c 6P {n)&{c l 6P (ri)l = 0,...,2Q4l} 



(Equation 137) 



15 



where c[ P {ri),l = 0,...,2047 are the candidate excitation vectors from the 6-pulse codebook, 
and c 6P (ri) is deemed the best candidate. The pitch enhancement of the 3 -pulse codebook 
is preferably used, see Equation 135 and 136. 

Each of the pulses are restricted to a track. The tracks of the first 5 pulses have 2 
positions and the last track has 1 position. The sign of each pulse is specified with 1 bit. 
This is equivalent to a total of 1 1 bits or to 2048 entries. The tracks for 6 pulses may be 
given by T p i, T p2 , T p3 , T p4 , T p5 , and T p6 . 

T p i: {0,5} 

T p2 : {9,14} 

T p3 : {18,23} 

T p4 : {27,32} 

T p5 : {36,41} 

T p6 : {46} 

Again, a reduced complexity search may be used to simplify the search. 
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G. Final Selection. 

The selection of the final fixed codebook excitation preferably compares the 
WMSE of the best candidate from each of the three sub codebooks after applying 
appropriate weighting according to the classification information. The modifications of 
5 the excitation described by Equation 128 and the pitch enhancement (Equation 129) are 
incorporated into the excitation. The final fixed codebook excitation is denoted v c or 

H. Energy Normalization and Correction 

The Energy Normalization, Smoothing, and Correction module is similar to the 
10 Energy Normalization, Smoothing, and Correction described for Mode 0. However, it is 
also different Mode 1 energy smoothing is not performed because Mode 1 does not 
encounter segments of stationary background noise. Furthermore, only the fixed 
codebook contribution is considered since the adaptive codebook gain was previously 
quantized. The process involves estimating the unquantized fixed codebook gain, the 
15 open-loop and closed loop energies, the normalized gains and the new target signal for 
the gain quantization. 

I. Estimation of Unquantized Gains. 

Since the adaptive codebook gain is already quantized, only the unquantized fixed 
codebook gain needs to be estimated. It is estimated to minimize the WMSE between the 
20 original speech signal and the reconstructed speech signal according to Equation 138, 

g c - arg jlj (t'(n) - (g c v c («) * h{n))) 2 j , (Equation 138) 

where 

tXn) = t{n)~{g a v a {n)*h{n)). (Equation 139) 

The gain is given by 
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^ g, W .(»>).»W) (Equation^) 
J. Energy Estimation 

The target energy of the quantized excitation (also referred as the open loop target 
energy) can be expressed by Equation 141 , 



\2 



5 4=2» 2 , (Equation 141) 

rt=0 

where e(n) is the residual signal. The target energy of the reconstructed speech (also 
referred to as the closed loop target energy) is expressed by Equation 142. 

E s J±tXnf, (Equation 142) 

where t\n) is expressed by Equation 139. 
10 K. Energy Normalization 

Based on the smoothed open and closed loop energy targets, a open and closed 
loop scaling factor for the codebook gains are estimated to match the energy targets. The 
open and closed loop scaling factors may be expresses as: 



Sol = 



(Equation 143) 



79 



^|£(g c -" c (»)) 2 

15 and 

E, (Equation 144) 



gel = 



X(g c V c (K)*A(«)) 2 
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Based on the attenuation calculated by Equation 145, 



g _1 =o.75. 



2>) 2 



(Equation 145) 



the final scaling factor g scl is determined through a linear combination of the open and 
closed loop scaling factors according to Equation 146. 

5 g sc i=\{g- l -go! + (l-g- l )-g c i), boundedby l.0<g, c/ <(l.0 + g- 1 ). 

(Equation 146) 

Like Mode 0, closed loop scaling is favored for non-flat signals and open-loop scaling is 
„ favored for flat signals. 

L. Energy Correction. 

10 If the signal to background noise ratio is below 12 dB, the unquantized fixed 

codebook gain is modified according to Equation 147, 

g>g sc rg c (Equation 147) 

based on the final scaling factor. If the signal to background noise ratio is above 12 dB 
the energy correction is not performed and the unquantized fixed codebook gain is not 
15 modified. 

g' c = g c . (Equation 148) 

The target is corrected according to Equation 149, 

t(n) = g a (n)v a (n) * h(n) + g' c (n)v c (n) *h(n) . (Equation 149) 

The correction of the target artificially increases the correlation between the target signal 
20 and the filtered excitation vectors, avoiding the typical energy fluctuations for waveform 
matching (CELP coders) of noise-like signals. This phenomenon is typically caused by 
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erratic correlation between the target and the filtered excitation vectors due to low bit-rate 
excitation. However, as opposed to Mode 0, the adaptive codebook contribution is not 
affected. Consequently, only the fixed codebook contribution is affected. 

M. 3D VQ of Fixed Codebook Gains. 

5 The subframe processing of Mode 1 is performed with unquantized fixed 

codebook gains. The excitation signals, target signals, quantized adaptive codebook 
gains are buffered during the subframe processing and used to perform delayed joint 
quantization of the three fixed codebook gains with an 8 bits vector quantizer. 

The delayed quantization introduces an error during the subframe processing 

10 since the past excitation for the 2 nd and 3 rd subframe is not fully quantized, i.e., the 
adaptive codebook is not correct. However, the error appears to be negligible. To fully 
synchronize encoder and decoder as well as to correctly update the filter memories, the 
synthesis for all subframes are repeated with fully quantized parameters once the delayed 
vector quantization of the three fixed codebook gains is complete. The 3 -dimensional 

1 5 codebook is searched to minimize E, 

52 

+ f^{t\n)-{gy a ( n )*h(n) + gy c (n)*h(n))y , (Equation 150) 

«=0 

+ S (' 3 < M ) - d» v » * *(») + * K»)f 

where the quantized pitch gains {gl,gl,gl} originate from the original frame based 
processing, and {/>),/ 2 (/2),/ 3 («)} ? ^l{n\vl(n\vl{n)}, and {vj(n) 5 v c 2 («) s v c 3 (/?)} are buffered 
during the subframe processing. 
20 The fixed codebook gains {g',g c 2 ,g c 3 are derived from an 8 bits codebook where 

the entries of the codebook contain a 3 -dimensional correction factor for the predicted 
fixed codebook gains. The prediction of the fixed codebook gains is based on MA 
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prediction of the fixed codebook energy. The relation between the correction factors 
rl and the quantized fixed codebook gains is given by Equation 151, 

gi=ri-gt, (Equation 151) 

where g J c is the quantized fixed codebook gain and g J c is the predicted fixed codebook 
5 gain of the j th subframe of frame k . The predicted fixed codebook gains is based on MA 
prediction of the fixed codebook energy, and it is given by Equation 1 52, 

g c '=l0» u ' (Equation 152) 

where the E = 34dB is the mean energy, and 

E{ = lOlog J ^vi(n)A , (Equation 153) 

10 and 

E> k = |>, • (201og 10 y$:l ). (Equation 154) 

The prediction coefficients of the MA prediction are {b )y b 2 ,b 3 } = {0.6,0.3,0.1} . The 
prediction of the energy from further back has greater leakage to accommodate the 
greater uncertainty associated with the prediction. This applies to the 2 nd and 3 rd subframe 

1 5 where the most recent history is not yet available due to the joint quantization. 

An alternative and better method has been developed. This method applies 3 rd 
order MA prediction of the energies for all 3 subframes, and thus, does not use only 2 nd 
order and 1 st order MA prediction for the second and third subframe, respectively. 
Instead, the additional leakage for the second and third subframe is introduced by having 

20 3 different predictors. The prediction equivalent to Equation 154 is can be expressed by 
Equation 155, 
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H =5X •(^Olog.or, 4 :;), (Equation 155) 

i=i 

where the predictor coefficients are 



{0.6, 0.30 } 0. 100} j = 1 (1 st subframe) 

{0.4, 0.25, 0.100} 7 =2 (2 nd subframe) . (Equation 156) 

{0.3, 0.15, 0.075} y = 3 (3 rd subframe) 



5 Consequently, the prediction of the energies of the 3 subframes is based on the same past 
memory. This method provides a more stable prediction error with less fluctuation and 
outliers improving the accuracy of the quantization. 

12. Decoder 

A block diagram of the decoder 700 is shown in the Fig. 7. Decoder 700 is based 
10 on an inverse mapping of the bit-stream to the method parameters followed by synthesis 
according to the mode decision. The synthesis is similar for both modes. The 
differentiating factor is the number of subframes and the decoding of the parameters 
(excitation vectors and gains) from the bit stream. 

The decoder comprises an adaptive codebook 710 and a fixed codebook 720 as 
15 shown in Fig. 7. An adaptive codebook 710 is operable with both a Mode 0 711 and a 
Mode 1 712. Similarly, a fixed codebook 720 is operable with both a Mode 0 721 and a 
Mode 1 722. A code-vector is provided from the adaptive codebook 710, shown as v 1 ^ 
to a first gain stage. Similarly, a code-vector is provided from the fixed codebook 720, 
shown as v^c, to a second gain stage. The gains by which the first gain stage and the 
20 second gain stage operate are controlled by a common block that is operable at a Mode 0 
790 and a Mode 1 791. The Mode 0 block 790 contains a two dimensional (2D) vector 
quantization (VQ) gain codebook 792 that is operable to provide both adaptive and fixed 
gain control The Mode 1 block 791 contains a three dimensional (3D) vector 
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quantization (VQ) gain codebook 793a and a three dimensional (3D) vector quantization 
(VQ) gain codebook 793c. The 3D VQ gain codebook 793a is operable to provide the 
gains for the adaptive codebook, and the 3D VQ gain codebook 793c is operable to 
provide the gains for the fixed codebook. 
5 The outputs from the first gain stage and the second gain stage are each fed to a 

summing junction, and the output from the summing junction received by a synthesis 
filter 772. The output from the synthesis filter 772 is received by a post-processing block 
774 from which a reconstructed speech signal is provided. The operation of all blocks of 
the decoder has been described above except for a frame erasure concealment and a post 
10 processing. 

The decoder initially checks for frame erasure by checking the synchronization word. 
If the frame is declared "good" the regular decoding proceeds (as given by the encoder). 
If the frame is declared "bad" the erasure concealment is activated. It is performed on 
the parameter level similar to G.729. 

15 13. Frame Erasure Concealment 

The frame erasure concealment is performed on a parameter level. This involves 
predicting the Mode, the LPC synthesis filter, the pitch track, the fixed codebook 
excitation, and the adaptive and fixed codebook gains, and from the predicted parameters 
synthesize the speech of the erased frame. 

20 A, Mode. 

The mode is predicted as the previous mode. This is based on the observation that 
adjacent frames often are in the same mode. 

B. LPC synthesis filter 

The prediction of the LPC filter is based on the LSFs. The LSFs are estimated as 
25 the previous LSFs shifted slightly towards the mean, i.e., 
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lsf„ =0.9-^, +0.1.<lsf>, 



(Equation 157) 



where <isf> is the mean (fixed) of the LSFs, and lsf and lsf n are the reconstructed LSFs 
of the past and current frame, respectively. The memory for the MA predictor is updated 
with a weighted average of the past 4 updates according to 



During the frames following a frame erasure the gain of the LPC filter is closely 
monitored in order to detect abnormal behavior. 

C Pitch track 

For Mode 0 the pitch lag of the first subframe, I„(0) , is set to the pitch lag of the 
10 second subframe of the previous frame, 1^(1), and for the second subframe the pitch lag, 
4(1) , is incremented by 1, i.e., 

L n (0) = L n ^ (l) , (Equation 159) 

L n (1) = L n (0) + 1 . (Equation 160) 

For Mode 1 the pitch track interpolation is based on the previous pitch lags used for the 
1 5 pitch interpolation according to 

Zf" = i£ x , (Equation 161) 

= L% . (Equation 162) 

Based on the pitch lag information the adaptive codebook contribution is derived as in the 
encoder. 



- lsf 1 ^ - lsf 



(Equation 158) 
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D. Fixed Codebook Excitation. 

For Mode 0 a random entry from the Gaussian sub codebook is selected. The 
signs and the entries for the two basis vectors are generated with a pseudo random 
number generator. For Mode 1 a random entry from the 6-pulse codebook is selected. 
5 Alternatively, a randomly selected Gaussian excitation vector (as for Mode 0) could be 
used. 

E. Adaptive and Fixed Codebook Gains. 

For Mode 0 the adaptive and fixed codebook gains are gradually decreased 
according to Equations 163 and 164 ; 

10 %a,» = * g fl ^i (Equation 163) 



and 




(Equation 164) 



where the scaling factors are given by Equations 165 and 166. 




(Equation 165) 



15 



and 



"0.98 z'<3 
0.80 i = 4 



(Equation 166) 



0.30 / = 5 



0.20 / > 6 
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The index i in Equation 163, 164, 165, and 166 specifies the number of consecutive 
frame erasures, and the index n in Equation 163 and 164 represents the running subframe 
number. 

The energy of the complete fixed codebook excitation is scaled down (as opposed 
5 to simply scaling the fixed codebook gain) in order to better control the energy evolution 
or change during frame erasures. For Mode 1 the adaptive and fixed codebook gains are 
quantized separately. The estimation of the adaptive codebook gain is given by Equation 
167, 

£a* = "a, * max jjj S ]> S**-i J (Equation 167) 

1 0 where the scaling factor is 
0.98 / = 1 

0.96 l</<6 . (Equation 168) 

0.70 i > 6 

The fixed codebook gain is preferably estimated according to Equations 1 64 and 
1 66, except that the down scaling for the first subframe of the first frame ersures is 1 .0 as 
opposed to 0.98. The index n in Equation 167 represents the running subframe number. 

15 14. Post Processing 

The post processing is similar to the main body of G.729 except that the short- 
term post filter given by the following equation: 



a\ y 

H« (?) = Y ri *\ (Equation 169) 
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is adapted according to the background noise level. The adaptation is introduced by 
making y ln a function of the signal to background noise ratio (estimated at the decoder) 

according to Equation 170. 



y ln _ x - 0.025 bounded by y Xn > 0.57, ^ dec > 12dB 

(Equation 170) 

/Vi + 0.025 bounded by /, „ < 0.75, ^ dec < 12dB 



5 Consequently, during noisy segments the short-term post filter converges towards 

a flat filter (disable short-term post filter), and during "clean" segments it converges 
towards the short-term post filter of G.729. The weighting factor y lfi is updated on a 

subframe basis. 

While various embodiments of the invention have been described, it will be 
10 apparent to those of ordinary skill in the art that many more embodiments and 
implementations are possible that are within the scope of this invention. Accordingly, the 
invention is not to be restricted except in light of the attached claims and their 
equivalents. 
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What is claimed is: 

1 L An extended code excited linear prediction speech coding method, the 

2 method comprising: 

3 analyzing an input signal according to a predetermined plurality of features of the 

4 input signal; 

5 initially performing common frame based speech coding to the input speech 

6 signal irrespective of the predetermined plurality of features of the input signal; 

7 subsequently performing speech coding on the input speech signal to generate an 

8 encoded speech signal using at least one of a first speech coding mode and a second 

9 speech coding mode; 

10 dynamically selecting between the first speech coding mode and the second 

1 1 speech coding mode based on at least one of the predetermined plurality of features of the 

12 input signal; 

13 the first speech coding mode comprises a first framing structure; 

14 the second speech coding mode comprises a second framing structure; and 

15 the first framing structure and the second framing structure comprise a plurality of bits. 
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Abstract 

The invention improves the encoding and decoding of speech by focusing the 
encoding on the perceptually important characteristics of speech. The system 
analyzes selected features of an input speech signal, and first performing a common 
frame based speech coding of an input speech signal. The system then performs a 
speech coding based on either a first speech coding mode or a second speech codmg 
mode The selection of a mode is based on characteristics of the input speech signal 
The first speech coding mode uses a first framing structure and the second speech 
coding mode uses a second framing structure. 



10 



78 




'i 

■Z 

u 
m 
m 
m 

I 

m 
1:3 



CO 




33 co 



o 
o 



CD 
1 

T3 
— t 
O 
O 
CD 
0) 
CO 

3' 
CQ 



CD 
CD 
O 

=r 

CO 
cq" 

0) 




> 

3 O) 





3 

CD 
0) 



O 

0) r* 1 



CD 

w O 
3 



E 



CD 

CD 
CD 
O 









s< 








SI 








ft: 









CD 
i 

T3 

O 
O 
CD 
(/> 
0) 
CD 
Q. 



CO 

* 

CO 







CD 


"0 


cq" 


CD 


CO S 


— * 
O 


£ 5 " 


CD 


O CQ 


"O 




ual 


CD 









0) 
"D 
CD 
CD 
O 

zr 



0) 

3 

CD 



CQ 

■ 






o 






od 


31 


II 


CD 
CT 


/V v 




O 


Q- 




O 






7T 





3 i 

— H 
O 

— s 

3 

0) 



O 
o 
3 



o 
3 



N 
0) 

O 

3 
4^ 
M 




< CD 

1*5 III 

CQ 0) 



3 

3" 

N 

0) 
<-» 

O 

3 

1-^ 




ro CD CD 

S2 

C/> 









Pe 






eigl 


rce 




CD 

— * 


— > 

3* 


"& 






CQ 












o 

m 
■■. 

m 
m 
m 
ru 

a: 

-.3 

Si 
iU 

Wl 
O 
P 



CO 

■ 



C/) 
c 
cr 

— t 

3 

CD 



CD | 



I 

I 

i 

, — ■! 



I 

I 



■ v 

K3 



CD 



CD 



CD | 
I 





T> 


T| 






O 




o 

IT 






m 


0) 




=r 


Q- 
i 


o 




U3 


o 




0) 




O 


O 




CD 






me 


0) 
— t 




i— i- 





-n 
cq" 




-A. 

CD 




O 






< 

o 


CO 










1L- 



CQ I 



















Er 




CD 


CD 






CO 




ro 


</)' 












CD 
— s 





3" 

N 

0) 
r- t- 

o 

ICD 






CD 






ICD 


=* 


oo 


CD 








13' 




CO 



0> 
—I 
CQ 
CD 



O 

Q- 

CD 
CL 










Pe 


ICD 


Filter 


eightin 


rceptu 






CQ 












5-J 
o O 

cr.i o 
o i 




1-1- 



< ml 

rs o 




m 

i O 

H cq CD 

l°> » 8 3 

CD 
Q 





z 






0 

— * 

3 


rn 


ICD 
CD 




net 




N 


CQ 




0) 

*— k 


*< 




o" 











